I'm trying to capture some data from logs in an application. The logs look like so:
*junk* [{count=240.0, state=STATE1}, {count=1.0, state=STATE2}, {count=93.0, state=STATE3}, {count=1.0, state=STATE4}, {count=1147.0, state=STATE5}, etc. ] *junk*
If the count for a particular state is ever 0, it actually won't be in the log at all, so I can't guarantee the ordering of the objects in the log (The only ordering is that they are sorted alphabetically by state name)
So, this is also a potential log:
*junk* [{count=240.0, state=STATE1}, {count=1.0, state=STATE4}, {count=1147.0, state=STATE5}, etc. ] *junk*
I'm somewhat new to using regular expressions, and I think I'm overdoing it, but this is what I've tried.
^[^=\n]*=(?:(?P<STATE1>\d+)(?=\.0,\s+\w+=STATE1))*.*?=(?P<STATE2>\d+)(?=\.0,\s+\w+=STATE2)*.*?=(?P<STATE3>\d+)(?=\.0,\s+\w+=STATE3)
The idea being that I'll loook for the '=' and then look ahead to see if this is for the state that I want, and it may or may not be there. Then skip all the junk after the count until the next state that I'm interested in(this is the part that I'm having issues with I believe). Sometimes it matches too far, and skips the state I'm interested in, giving me a bad value. If I use the lazy operator(as above), sometimes it doesn't go far enough and gets the count for a state that is before the one I want in the log.
After some experimentation, this is what I've come up with:
The answers provided here, although good answers, don't quite work if your state names don't end with a number (mine don't, I just changed them to make the question easier to read and to remove business information from the question).
Here's a completely tile-able regex where you can add on as many matches as needed
count=(?P<GROUP_NAME_HERE>\d+(?=\.0, state=STATE_NAME_HERE))?
This can be copied and appended with the new state name and group name. Additionally, if any of the states do not appear in the string, it will still match the following states. For example:
count=(?P<G1>\d+(?=\.0, state=STATE_ONE))?(?P<G2>\d+(?=\.0, state=STATE_TWO))?(?P<G3>\d+(?=\.0, state=STATE_THREE))?
will match states STATE_ONE
and STATE_THREE
with named groups G1
& G3
in the following string even though STATE_TWO is missing:
[{count=55.0, state=STATE_ONE}, {count=10.0, state=STATE_THREE}]
I'm sure this could be improved, but it's fast enough for me, and with 11 groups
, regex101 shows 803
steps with a time of ~1ms
Here's a regex101 playground to mess with: https://regex101.com/r/3a3iQf/1
Notice how groups 1,2,3,4,5,6,7,9, & 11 match. 8 & 10 are missing and the following groups still match.