Search code examples
pythonregexregex-lookarounds

Regular expression negative lookbehind


I want to have a matching regexp pattern that matches all the addresses that end in 4 or more digits, but not coming after 'APT', 'BOX', 'APT ', or 'BOX '. So it should match the following cases:

HITME 1234
HITME 12345
HITME1234

but not the following cases:

BOX 1234
BOX 12345
BOX4044
APT 1234
APT 12345
NONHIT123
NONHIT 123

I have made this one

(?<!(APT |BOX ))([0-9]{4,})$

but it does not work right. Somehow still matches the no-no cases.


Solution

  • TL;DR use ^(?!APT|BOX).*?([0-9]{4,})$


    Your regex (?<!(APT |BOX ))([0-9]{4,})$ incorrectly matches:

    • BOX 12345 on 2345 because it is not preceded by APT or BOX . Instead, it is preceded by BOX 1
    • BOX4044 on 4044 because it is not preceded by APT or BOX . Instead, it is preceded by BOX
    • APT 12345 on 2345 for a similar reason.

    The regex you're looking for is ^(?!APT|BOX).*?([0-9]{4,})$, which is broken down like so:

    • ^(?!APT|BOX) - the beginning of the string cannot be followed by APT or BOX
    • .*? - a bunch of garbage in the middle of the string, taking as few characters as possible (i.e. HITME in your test cases)
    • ([0-9]{4,})$ - the matched digits at the end of the string