Search code examples
pythonregexpython-re

Regex match words that are by their own or surrounded by underscores


I'm trying to match the word int that's either by its own or it's surrounded by underscores (_).

int  # match
_int_  # match
__int__  # match
some_int  # match
int_var  # match
integration  # doesn't match
mint  # doesn't match

This is what I've been trying, but it only matches the second case above

pattern = re.compile(r"(?<=[\W_])int(?=[\W_])")

How should I go about doing this? Thanks everyone


Solution

  • You need to use the double negation logic in this case:

    (?<![^\W_])int(?![^\W_])
    

    See the regex demo.

    The (?<![^\W_]) lookbehind matches a location that is not immediately preceded with any char other than a non-word and _ char. It means, there must be a start of string position or any non-word char other than _ immediately on the left.

    The (?![^\W_]) lookahead matches a location that is not immediately followed with any char other than a non-word and _ char. It means, there must be an end of string position or any non-word char other than _ immediately on the right.

    In your regex, the (?<=[\W_]) positive lookebehind you used requires a non-word or _ immediately on the left and (?=[\W_]) positive lookahead requires a non-word or an underscore char immediately on the right. Hence, these lookarounds are not allowing matches at the start or end of string.

    NOTE: As you are using Python re, you cannot simply add a ^| alternative to your lookbehind, because Python re does not allow lookbehinds with non-fixed-width patterns. (?<=[\W_]|^)int(?=[\W_]|$) will work in PHP/PCRE, Java, Ruby/Onigmo, but won't work in Python re. That is why double negation way is the only way here.