I need to write a bit of regex that is a bit over my head. The goal here is to parse the following type of log lines inside a logstash filter:
severity=I time=2017-02-23T10:04:31Z [SKYLIGHT] [0.5.1] Unable to start
severity=I time=2017-02-23T10:04:31Z adapter=redis adapter_host=1.1.1.1 Cache read: /model/reference/6235290d29a17a935f4d3d72d2e0a903750dd54b
severity=I time=2017-02-23T10:04:31Z remote_ip=1.1.1.1 uuid=daa8090d method=GET path=/somepath.json format=json controller=app action=index status=200 duration=30.47 view=10.04
severity=D time=2017-02-23T10:04:31Z remote_ip=1.1.1.1 uuid=daa8090d SOLR Request (18.3ms) [path=/admin/luke parameters={numTerms: 0}]
Essentially the output format is a set of arbitrary k=v pairs, followed by an occasional "raw message". Just using the logstash k/v filter directly produces undesired behavior since the trailing "message" can have k=v formats nested inside of it - such as path=/admin/luke in the final log line above. My working plan is to capture log into two parts, the k/v pairs as a string, and the trailing message, at which point the k/v string could be sent into the normal logstash kv filter. So for instance, the final log line would produce two groups:
severity=D time=2017-02-23T10:04:31Z remote_ip=1.1.1.1 uuid=daa8090d
SOLR Request (18.3ms) [path=/admin/luke parameters={numTerms: 0}]
With the end goal of the log document to be:
[
{
"severity": "I",
"time": "2017-02-23T10:04:31Z",
"message": "[SKYLIGHT] [0.5.1] Unable to start"
},
{
"severity": "I",
"time": "2017-02-23T10:04:31Z"
"adapter": "redis",
"adapter_host": "1.1.1.1",
"message": "Cache read: /model/reference/6235290d29a17a935f4d3d72d2e0a903750dd54b"
},
{
"severity": "I",
"time": "2017-02-23T10:04:31Z",
"message": "[SKYLIGHT] [0.5.1] Unable to start"
},
{
"severity": "I",
"time": "2017-02-23T10:04:31Z",
"remote_ip": "1.1.1.1",
"uuid": "daa8090d",
"method": "GET",
"path": "/somepath.json",
"format": "json",
"controller": "app",
"action": "index",
"status": "200",
"duration": "30.47",
"view": "10.04"
},
{
"severity": "D",
"time": "2017-02-23T10:04:31Z",
"remote_ip": "1.1.1.1",
"uuid": "daa8090d",
"message": "SOLR Request (18.3ms) [path=/admin/luke parameters={numTerms: 0}]"
}
]
Thank you!
For each row use the following regular expression:
(?:([^ =]+)=([^ =]+) ?)|(.+)
Explanation:
(?:
- "External", non-capturing group (xxxx=yyyy
).([^ =]+)
- First capturing group (xxxx
).=
- Equals sign (between xxxx
and yyyy
).([^ =]+)
- Second capturing group (yyyy
).?
- A space (may occur).)
- End of the "external" group.|
- Separator between variants.(.+)
- Second variant - third capturing group, any non-empty sequence of chars.Note that regex processor initially tries the 1st variant (before the |
),
capturing xxxx=yyyy
pairs.
Then, if the 1st variant failed (after all xxxx=yyyy
pairs),
the 2nd variant is tried, capturing the message (if any).
I tried this regex using an online verifier (regex101.com) for each your input row.
E.g. for the last row
(severity=D time=2017-02-23T10:04:31Z remote_ip=1.1.1.1 uuid=daa8090d SOLR Request (18.3ms) [path=/admin/luke parameters={numTerms: 0}
)
I got the following results:
Match 1
Full match 0-11 `severity=D `
Group 1. 0-8 `severity`
Group 2. 9-10 `D`
Match 2
Full match 11-37 `time=2017-02-23T10:04:31Z `
Group 1. 11-15 `time`
Group 2. 16-36 `2017-02-23T10:04:31Z`
Match 3
Full match 37-55 `remote_ip=1.1.1.1 `
Group 1. 37-46 `remote_ip`
Group 2. 47-54 `1.1.1.1`
Match 4
Full match 55-69 `uuid=daa8090d `
Group 1. 55-59 `uuid`
Group 2. 60-68 `daa8090d`
Match 5
Full match 69-133 `SOLR Request (18.3ms) [path=/admin/luke parameters={numTerms: 0}`
Group 3. 69-133 `SOLR Request (18.3ms) [path=/admin/luke parameters={numTerms: 0}`
Note that in case of matches No 1 to 4, groups 1 and 2 were found.
But for the last match, group 3 was found.
So, processing each match, you have to check:
If group 1 is not empty, then group 2 is also not empty
and they contain k
and v
.
Otherwise, group 3 holds the content of the message.