Search code examples
pythonregexregex-lookarounds

Positive lookbehind regex not matching as expected


I'm trying to use a positive lookbehind in python regex to match device names and serial numbers in this sample zpool output. I think I'm not quite understanding something about the lookbehind syntax because I'm not able to match the serial numbers.

I'm using the Patterns app on my desktop to sandbox this. I have several other StackOverflow questions about lookbehind assertions, but what I can find just seems to suggest I'm on the right track, and nothing I've seen so far has made it clear what I'm getting wrong.

pool                           ONLINE       0     0     0
  raidz2-0                       ONLINE       0     0     0
    diskid/DISK-PK2331PAG6ZLMT   ONLINE       0     0     0 
    da21                         ONLINE       0     0     0 
    diskid/DISK-PK2331PAG6ZVMT   ONLINE       0     0     0 
    diskid/DISK-PK2331PAG728ET   ONLINE       0     0     0 
    diskid/DISK-PK2331PAG6YGXT   ONLINE       0     0     0 

I want to grab the device or serial number in the first group, and its status (ONLINE|AVAIL) in the second group. The regex I'm using is:

^\s+(da\d+|(?<=diskid/DISK-)\S+)\s+(ONLINE|AVAIL)\s

It's matching the device name da21 and its status, but it's not seeing the devices named by serial number. What am I missing about this syntax?


Solution

  • Why it's not working

    Let's look at a single line to see what your regex is matching:

    # your regex
    ^\s+(da\d+|(?<=diskid/DISK-)\S+)\s+(ONLINE|AVAIL)\s
    
    # your string
        diskid/DISK-PK2331PAG6ZLMT   ONLINE       0     0     0
    <                     # ^ assert position at start of string
    ^^^^                  # \s+ match one or more whitespace characters
        ^!                # da\d+ matches d, fails to match a, backtrack; try next alternation
    <<<<<!                # (?<=diskid/DISK-) assert what precedes matches the lookbehind
    # This fails because the text to the left of the position that the parser is at does
    #     not match diskid/DISK- (it's four spaces as was previously matched by \s+)
    

    How to fix it?

    There are multiple regex patterns that may satisfy what you're trying to accomplish:

    Option 1: Single capture group

    This captures \S+ if it's preceded by diskid/DISK-, or da\d+ into capture group 1, then captures ONLINE or AVAIL into capture group 2.

    ((?<=diskid/DISK-)\S+|da\d+)\s+(ONLINE|AVAIL)\b
    

    Pro: One capture group
    Con: It can't ensure that the first capture group is at the start of the line

    Option 2: Anchored to the start of the line

    This captures \S+ into capture group 1 if it's preceded by diskid/DISK-, or da\d+ into capture group 2, then captures ONLINE or AVAIL into capture group 3.

    ^\s+(?:diskid/DISK-(\S+)|(da\d+))\s+(ONLINE|AVAIL)\b
    

    Pro: Anchored to start of line - we can ensure that's where the data is that we're trying to match (^\s+) Con: Two capture groups (we can't match two different sets of data with two different sets of conditions for prepended strings into one capture group)

    Option 3: use regex library

    We can accomplish it using PyPi regex library quite easily yielding us one group and asserting its position in the string.

    Branch reset method (the alternation yields a single capture group instead of two):

    ^\s+(?|diskid/DISK-(\S+)|(da\d+))\s+(ONLINE|AVAIL)\b
          ^           # same as option 2, but uses branch reset