Search code examples
pythonregexpython-re

Not more than one special symbol in a range from a long text


Simplify the problem:

There is an article (long text)

Extract the content between start (included) and end (included)

Requirement: There cannot be more than one \n between start and end

Find all matches

Use python re only

For code:

lines = re.findall(pattern, text, re.DOTALL)
for line in lines:
    print(line)
    print('===')

So, how can I fixed my pattern?

What I try pattern:

  1. start[^\n]*\n?[^\n]*end with text:
...
start just me and python regex 1 end
start just me and python regex 2 end
start just me and python regex 3 end
...

wrong:

start just me and python regex 1 end
start just me and python regex 2 end --> should be split with the line before
===
start just me and python regex 3 end
===
  1. start(?:(?!\n\n).)*?end and start(?:[^\n]|\n(?!\n))*?end with text:
start just 
me and python 
regex 1 end
start just me and python regex 2 end
start just me and python regex 3 end

wrong:

start just 
me and python 
regex 1 end --> should not match this cause there is two `\n` in
===
start just me and python regex 2 end
===
start just me and python regex 3 end
===

Solution

  • you can use: start[^\n]*?\n?[^\n]*?end