Search code examples
pythonregex

Negative lookahead regex in `re.subn()` context


I am trying to use regular expressions to replace numeric ranges in text, such as "4-5", with the phrase "4 to 5".

The text also contains dates such as "2024-12-26" that should not be replaced (should be left as is).

The regular expression (\d+)(\-)(\d+) (attempt one below) is clearly wrong, because it falsely matches dates.

Using a negative lookahead expression, I came up with the regex (?!\d+\-\d+\-)(\d+)(\-)(\d+) instead (attempt two below), which correctly matches "4-5" while rejecting "2024-12-26".

However, attempt_two does not behave correctly in a re.subn() context, because although it rejects "2024-12-26", the search continues on to match (and replace) the substring "12-26":

import re

text = """
2024-12-26
4-5
78-79
"""

attempt_one = re.compile(r"(\d+)(\-)(\d+)")
attempt_two = re.compile(r"(?!\d+\-\d+\-)(\d+)(\-)(\d+)")

print("Attempt one:")
print(re.match(attempt_one, "4-5"))  # Match: OK
print(re.match(attempt_one, "2024-12-26"))  # Match: False positive
new_text, _ = re.subn(attempt_one, r"\1 to \3", text)  # Incorrect substitution
print(new_text)

print("Attempt two:")
print(re.match(attempt_two, "4-5"))  # Match: OK
print(re.match(attempt_two, "2024-12-26"))  # Doesn't match: OK
new_text, _ = re.subn(attempt_two, r"\1 to \3", text)  # Still incorrect
print(new_text)

Output:

Attempt one:
<re.Match object; span=(0, 3), match='4-5'>
<re.Match object; span=(0, 7), match='2024-12'>

2024 to 12-26
4 to 5
78 to 79

Attempt two:
<re.Match object; span=(0, 3), match='4-5'>
None

2024-12 to 26
4 to 5
78 to 79

What regular expression can I use so that the substitution returns the following instead?

2024-12-26
4 to 5
78 to 79

(As my goal is to learn about regular expressions, I am not interested in workarounds such as matching the whitespace or newline after "12-26".)


Solution

  • You need both a negative lookbehind and a negative lookahead, to prohibit an extra hyphen before or after the match.

    (?<![-\d])(\d+)-(\d+)(?![-\d])
    

    The lookarounds also have to match digits, so it won't match part of the date, e.g. 024-1 from 2024-12-26.