I'm trying to use a positive lookbehind in python regex to match device names and serial numbers in this sample zpool
output. I think I'm not quite understanding something about the lookbehind syntax because I'm not able to match the serial numbers.
I'm using the Patterns app on my desktop to sandbox this. I have several other StackOverflow questions about lookbehind assertions, but what I can find just seems to suggest I'm on the right track, and nothing I've seen so far has made it clear what I'm getting wrong.
pool ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
diskid/DISK-PK2331PAG6ZLMT ONLINE 0 0 0
da21 ONLINE 0 0 0
diskid/DISK-PK2331PAG6ZVMT ONLINE 0 0 0
diskid/DISK-PK2331PAG728ET ONLINE 0 0 0
diskid/DISK-PK2331PAG6YGXT ONLINE 0 0 0
I want to grab the device or serial number in the first group, and its status (ONLINE|AVAIL) in the second group. The regex I'm using is:
^\s+(da\d+|(?<=diskid/DISK-)\S+)\s+(ONLINE|AVAIL)\s
It's matching the device name da21
and its status, but it's not seeing the devices named by serial number. What am I missing about this syntax?
Let's look at a single line to see what your regex is matching:
# your regex
^\s+(da\d+|(?<=diskid/DISK-)\S+)\s+(ONLINE|AVAIL)\s
# your string
diskid/DISK-PK2331PAG6ZLMT ONLINE 0 0 0
< # ^ assert position at start of string
^^^^ # \s+ match one or more whitespace characters
^! # da\d+ matches d, fails to match a, backtrack; try next alternation
<<<<<! # (?<=diskid/DISK-) assert what precedes matches the lookbehind
# This fails because the text to the left of the position that the parser is at does
# not match diskid/DISK- (it's four spaces as was previously matched by \s+)
There are multiple regex patterns that may satisfy what you're trying to accomplish:
This captures \S+
if it's preceded by diskid/DISK-
, or da\d+
into capture group 1, then captures ONLINE
or AVAIL
into capture group 2.
((?<=diskid/DISK-)\S+|da\d+)\s+(ONLINE|AVAIL)\b
Pro: One capture group
Con: It can't ensure that the first capture group is at the start of the line
This captures \S+
into capture group 1 if it's preceded by diskid/DISK-
, or da\d+
into capture group 2, then captures ONLINE
or AVAIL
into capture group 3.
^\s+(?:diskid/DISK-(\S+)|(da\d+))\s+(ONLINE|AVAIL)\b
Pro: Anchored to start of line - we can ensure that's where the data is that we're trying to match (^\s+
)
Con: Two capture groups (we can't match two different sets of data with two different sets of conditions for prepended strings into one capture group)
regex
libraryWe can accomplish it using PyPi regex
library quite easily yielding us one group and asserting its position in the string.
Branch reset method (the alternation yields a single capture group instead of two):
^\s+(?|diskid/DISK-(\S+)|(da\d+))\s+(ONLINE|AVAIL)\b
^ # same as option 2, but uses branch reset