Search code examples
pythonregexgroupingnamed

regex named group if exist


Good morning,

I have a string that I need to parse and print the content of two named group knowing that one might not exist.

The string looks like this (basically content of /proc/pid/cmdline):

"""
<some chars with letters / numbers / space / punctuation> /CLASS_NAME:myapp.server.starter.StarterHome /PARAM_XX:value_XX /PARAM_XX:value_XX /CONFIG_FILE:myapp.server.config.myconfig.txt /PARAM_XX:value_XX /PARAM_XX:value_XX /PARAM_XX:value_XX <some chars with letters / numbers / space / punctuation>
"""

my processes have almost the same pattern, that is:

/CLASS_NAME:myapp.server.starter.StarterHome is always present, but /CONFIG_FILE:myapp.server.config.myconfig.txt is NOT always present.

I'm using python2 with re module to catch the values. So far my pattern looks like this and I'm able to catch the value I want corresponding to /CLASS_NAME

re.compile('CLASS_NAME:\w+\W\w+\W\w+\W(?P<class>\w+)')

The because /CONFIG_FILE is present or not, I added the following to myregexp:

re.compile(r"""CLASS_NAME:\w+\W\w+\W\w+\W(?P<class>\w+).*?
               (CONFIG_FILE:\w+\W\w+\W\w+\W(?P<cnf>\w+.txt))?
            """, re.X)

My understanding is that the second part of my rexexp is optional because the whole part is between parenthesis followed by ?.

Unfortunately my assumption is wrong as it couldn't catch it

I also tried by removing the 1st ? but it didn't help.

I gave several tries through PYTHEX to try to understand my regexp but couldn't find a solution.

Could anyone have any suggestion to resolve my case?


Solution

  • You can wrap the whole optional part within an optional non-capturing group and make the capturing group for CONFIG_FILE obligatory:

    re.compile(r"""CLASS_NAME:(?:\w+\W+){3}(?P<class>\w+)(?:.*?
                   (CONFIG_FILE:(?:\w+\W+){3}(?P<cnf>\w+\.txt)))?
            """, re.X)
    

    In case there are newlines, use re.X | re.S modifier options. Note that \w+\W\w+\W\w+\W is better written as (?:\w+\W+){3}.

    See the regex demo

    The main difference is (?:.*?(CONFIG_FILE:(?:\w+\W+){3}(?P<cnf>\w+\.txt)))? part:

    • (?: - start of an optional (as there is a greedy ? quantifier after it) non-capturing group matching
      • .*? - any 0+ chars, as few as possible
      • (CONFIG_FILE:(?:\w+\W+){3}(?P<cnf>\w+\.txt)) - matches
        • CONFIG_FILE: - a literal substring
        • (?:\w+\W+){3} - three sequences of 1+ word chars followed with 1+ non-word chars
        • (?P<cnf>\w+\.txt) - Group cnf: 1+ word chars, a dot (note it should be escaped) and then txt
    • )? - end of the optional non-capturing group (that will be tried once)