Search code examples
pythonregexclassmultiline

python symmetric start/end regex does not work


Python 3

read a string from the file:

with open(filepath, "r", encoding="utf-8") as f:
    content_string = f.read()

It looks line this:

---
section-1-line-1
section-1-line-2
section-1-line-3
---
section-2-line-1
section-2-line-2
section-2-line-3
---
section-3-line-1
section-3-line-2
section-3-line-3
---

I need to remove entire section that contains line section 2 line 2

So the end result should be

---
section-1-line-1
section-1-line-2
section-1-line-3
---
section-3-line-1
section-3-line-2
section-3-line-3
---

So I create regexp:

rx = re.compile(r'---[^-{3}]+section-2-line-2[^-{3}]+---', re.S)
content_string_modified = re.sub(rx, '', content_string)

This regexp above does nothing, i.e. does not match. If I remove the closing --- from the regex (r'---[^-{3}]+section-2-line-2[^-{3}]+') it matches partially - it finds starting negative class but does not use the quantifier of the closing negative class, i.e. ignores {3} and stops at the first dash, not at the first three dashes, so it leaves a chunk of section that needs to be removed:

---
section-1-line-1
section-1-line-2
section-1-line-3
-2-line-3
---
section-3-line-1
section-3-line-2
section-3-line-3
---

Why? How to make both starting and ending [^-{3}]+ to work? Thanks!


Solution

  • You cannot exclude matching of complex string with symbol class, but you can do it with negative lookaheads.

    For example, (?:(?!---).)* will match everything, what is not exactly three dashes.

    Your full regex will be

    ---(?:(?!---).)*section-2-line-2.*?(?=---)
    

    Notice, that you don't need lookaheads after your search phrase, as simple lazy quantifier is enough there.

    Demo here.

    Also, notice, that you shouldn't use re.sub, if you already compiled your regex.

    rx = re.compile(r'---(?:(?!---).)*section-2-line-2.*?(?=---)', re.S)
    content_string_modified = rx.sub('', content_string)
    

    Demo of code here.