Search code examples
pythonregex

python regex to match multiple words in a line without going to the next line


I'm writing a parser to parse the below output:

    admin@str-s6000-on-5:~$ show interface status Ethernet4
      Interface        Lanes    Speed    MTU         Alias    Vlan    Oper    Admin            Type    Asym PFC
---------------  -----------  -------  -----  ------------  ------  ------  -------  --------------  ----------
      Ethernet4  29,30,31,32      40G   9100  fortyGigE0/4   trunk      up       up  QSFP+ or later         off
PortChannel0001          N/A      40G   9100           N/A  routed      up       up             N/A         N/A
PortChannel0002          N/A      40G   9100           N/A  routed      up       up             N/A         N/A
PortChannel0003          N/A      40G   9100           N/A  routed      up       up             N/A         N/A
PortChannel0004          N/A      40G   9100           N/A  routed      up       up             N/A         N/A

I have made an attempt to write a regex to match all the fields as below

(\S+)\s+([\d,]+)\s+(\S+)\s+(\d+)\s+(\S+)\s+(\S+)\s+([up|down])+\s+([up|down]+)\s+([\w\s+?]+)\s+(\S+)

I'm able to get upto Admin column correctly. The column Type contains multiple words so i have used the pattern ([\w\s+?]+) hoping it will match multiple workds seperated by one space with + being optional followed by (\S+) to match the last column. The problem that I face is, regex ([\w\s+?]+) spawns over multiple lines and it gives me an output as below

Ethernet4 29,30,31,32 40G 9100 fortyGigE0/4 trunk up up QSFP+ or later off PortChannel0001 N/A

I see that \s matches the new line as well. how to restrict that not to match the new line? could someone pls help me to clarify.

I looked at this space Regex for one or more words separated by spaces but that is not helping me either. can someone help me to understand this better?


Solution

  • Suppose, for simplicity, the data were as follows.

    str = """
    admin@str-s6000-on-5:~$ show interface status Ethernet4
          Interface     Lanes   MTU  Alias    Ad         Type    Asym PFC
    ---------------  --------  ---- ------  ----  -----------  ----------
          Ethernet4  29,30,31  9100  fG0/4    up  Q+ or later         off
    PortChannel0001       N/A  9100    N/A    up          N/A         N/A
    """
    

    I would suggest you use both verbose (a.k.a. free spacing) mode (re.VERBOSE) and named capture groups to make the regular expression self-documenting:

    import re
    
    rgx = r"""
        ^                        # match beginning of line
        [ ]*                     # match zero or more spaces  
        (?P<Interface>\S+)       # match one or more non-whitespaces and
                                 # save to capture group 'Interface'
        [ ]+
        (?P<Lanes>\d+(?:,\d+)*)  # match one or more strings of two or
                                 # more digits separated by a comma and
                                 # save to capture group 'Lanes'
        [ ]+
        (?P<MTU>\d+)             # match one or more digits and save to
                                 # capture group 'MTU'
        [ ]+
        (?P<Alias>\S+)           # match one or more non-whitespaces and
                                 # save to capture group 'Alias'
        [ ]+
        (?P<Ad>up|down)          # match 'up' or 'down' and save to
                                 # capture group 'Ad'
        [ ]+
        (?P<Type>\S+(?:[ ]\S+)*) # match one or more groups of
                                 # non-whitespaces separated by one
                                 # space and save to capture group 'Type'
        [ ]*
        (?P<Asym_PFC>off|on)     # match 'up' or 'down' and save to
                                 # capture group 'Asym_PFC'
        [ ]*
        $                        # match end of line
        """
    

    Note I have assumed the whitespaces in the text are simply spaces (and not tabs, for one), in which case it is preferable to use spaces in the expression. Also, I've written each space to be in a capture group ([ ]); else it would be stripped out along with spaces that are not part of the expression. There are other ways to protect spaces, one being to escape them (\ ).

    You may then extract the contents of the capture groups as follows.

    match = re.search(rgx, str, re.VERBOSE | re.MULTILINE)
    
    if match:
        print("capture group 'Interface': ", match.group('Interface'))
        print("capture group 'Lanes':     ", match.group('Lanes'))
        print("capture group 'MTU':       ", match.group('MTU'))
        print("capture group 'Alias':     ", match.group('Alias'))
        print("capture group 'Ad':        ", match.group('Ad'))
        print("capture group 'Type':      ", match.group('Type'))
        print("capture group 'Asym_PFC':  ", match.group('Asym_PFC'))
    else:
        print('did not find')
    

    which displays

    capture group 'Interface':  Ethernet4
    capture group 'Lanes':      29,30,31
    capture group 'MTU':        9100
    capture group 'Alias':      fG0/4
    capture group 'Ad':         up
    capture group 'Type':       Q+ or later
    capture group 'Asym_PFC':   off
    

    Demo

    Note that it may be sufficient to determine if each line is a keeper, and if it is simply split the line on the regular expression {2,}; that is, split on two or more spaces (after stripping off spaces at the beginning and/or end of the string). If, for example, lines of interest begin 'Ethernet', possibly padded left with spaces, we could use the regular expression r'^ *Ethernet.*' to identify lines of interest and r' {2,}' to split those lines and assign the pieces to variables.

    Demo