Search code examples
pythonregexregex-negationregex-replace

Regular expression with negative lookahead and negative lookbehind to check if my match IS NOT between [[ and ]]


I'm trying to write a Python script which replaces occurrences of given keywords, in given md file, by themself between [[ and ]].

It will be used several times on the same files, so I don't want to end with, for instance, FOO becoming [[FOO]], then [[[[FOO]]]] etc.

So I don't want FOO to be circle with [[ and ]].

The closest version I came up with is this: (?<!\[\[)\b(FOO)\b(?!\]\])

The status of my test list is:

Should     match : lorem ipsum FOO dolor              ==> OK
Should NOT match : lorem ipsum [[FOO]]  dolor         ==> OK
Should NOT match : lorem [[ipsum FOO dolor]] sit amet ==> Not OK
Should NOT match : lorem [[ipsumFOOsolor]] sit amet   ==> OK
Should NOT match : [[lorem]]  [[ipsum-FOO&dolor-sit.pdf#page=130]] ==> Not OK

for reference, I would like to use this regexp in this python snippet:

    for term in term_list:
        pattern = r'(?<!\[\[)\b(' + re.escape(term) + r')\b(?!\]\])'
        file_content = re.sub(pattern, r'[[\1]]', file_content)

What could be the regexp I need? What is wrong with this approach?

Thanks!


Solution

  • What you might do, not taking nested [[..[[..]]..]] into account, is to get the [[...]] part out of the way and capture what you want to keep in a group.

    Then use that group in the replacement, and leave the part that is only matched (not in the group) untouched.

    You can see the regex matches here.

    This part in the pattern (?:(?!\[\[|]]).)* matches any charter that is not directly followed by either [[ or ]]

    import re
    
    pattern = r"\[\[(?:(?!\[\[|]]).)*\]\]|\b(FOO)\b"
    
    s = ("lorem ipsum FOO dolor\n"
                "Should NOT match : lorem ipsum [[FOO]]  dolor\n"
                "Should NOT match : lorem [[ipsum FOO dolor]] sit amet\n"
                "Should NOT match : lorem [[ipsumFOOsolor]] sit amet\n"
                "Should NOT match : [[lorem]]  [[ipsum-FOO&dolor-sit.pdf#page=130]]")
    
    result = re.sub(pattern, lambda x: f"[[{x.group(1)}]]" if x.group(1) else x.group(), s)
    print(result)
    

    Output

    lorem ipsum [[FOO]] dolor
    Should NOT match : lorem ipsum [[FOO]]  dolor
    Should NOT match : lorem [[ipsum FOO dolor]] sit amet
    Should NOT match : lorem [[ipsumFOOsolor]] sit amet
    Should NOT match : [[lorem]]  [[ipsum-FOO&dolor-sit.pdf#page=130]]