Search code examples
regexregex-group

Avoid extracting a word with a specific term


My plan is to extract group of words from a string with regex. However, I have sometimes the word NOT in front of a word which should be extracted. Not sure how to deal with that issue.

Test string:

tag=os index=linux index=windows NOT index=mac tag=db index="a_something-else" NOT   index=solaris

Current (failing) regex expression:

index=(\")?(?<my_indexes>\w+(-)?(\w+)?)(\")?

This regex expression is extracting all index=zyx words. But the case with the NOT e.g. NOT index=mac or NOT index=solaris should be avoided. E.g. the results should be like:

index=linux
index=windows
index="a_something-else"

Any suggestions?


Solution

  • As you meantion that it is PCRE, one option is to use a SKIP FAIL pattern, and use a capturing group with a backreference to pair up the matching double quote.

    Then you can make the double quote optional inside the capturing group and refer to it using \1 and \2

    note that you don't have to escape the double quote by itself.

    \bNOT\h+index=("?)\w+(?:-\w+)*\1(*SKIP)(*FAIL)|index=("?)\w+(?:-\w+)*\2
    

    Explanation

    • \bNOT\h+ Match NOT and 1+ horizontal whitespace chars
    • index=("?) Match index= And capture an optional " in group 1
    • \w+(?:-\w+)*\1 Match 1+ word chars, optionally repeated by - and 1+ word chars. Then a backreference to what is captured in group 1
    • (*SKIP)(*FAIL)| Skip the match
    • index=("?) Match index= And capture an optional " in group 2
    • \w+(?:-\w+)*\2 The same as the previous pattern above, now with a backreference to group 2

    Regex demo

    If you don't want the double quotes around a_something-else and only want the value after the =, you could use another capturing group, or use the named capturing group my_indexes

    \bNOT\h+index=("?)\w+(?:-\w+)*\1(*SKIP)(*FAIL)|index=("?)(?<my_indexes>\w+(?:-\w+)*)\2
    

    Regex demo