Search code examples
regexregex-lookaroundspcre

PCRE regex to match next occurrence of specific pattern or EOF


I have a file with following content:

#### v2

START MATCH

Text explaning things and stuff.
This has to be matched.

END MATCH

#### v1

Do not match this part (or anything
below "END MATCH" part).

#### v0

Do not match this either.

I'm trying to match everything between START MATCH and END MATCH (including new lines). However, the text below END MATCH might not exist, instead it can be the end of file. Also, there is no literal END MATCH text, it's just a marker to show what I'm trying to achieve.

I was trying out the following regex pattern (?<=# v.\n\n)(.|\n)*(?=\n(?:#.*?|$)) which seems to me fine if the file ends with END MATCH, but if there are additional lines below it (starting with new line and # character), my pattern captures that part as well.

How can I modify my pattern (probably just the last part (?=\n(?:#.*?|$))?) to exclude everything after END MATCH?

Example can be tested here: https://regex101.com/r/bJAZZq/1


Solution

  • You can use

    #\h+v\d+\R{2}\K(?s:.*?)(?=\R#|\z)
    

    See the regex demo.

    Details

    • # - a # char
    • \h+ - one or more horizontal whitespaces
    • v - a v letter
    • \d+ - one or more digits
    • \R{2} - two line break sequences
    • \K - omit the text matched so far
    • (?s:.*?) - any zero or more chars as few as possible
    • (?=\R#|\z) - up to the first occurrence of a line break sequence and then # or end of string.

    Please note that (.|\n)* is a very bad regex construct consuming a lot of computational resources and leading to performance issues, you should never use it.

    \R construct is very useful in PCRE and similar regex engines, it matches any kind of line breaks, \r, \n, \r\n and even more sometimes depending on the options or exact library implementation.

    \K allows to use + and * quantifiers before the text you want to actually grab with the regex, unlike the lookbehinds, where the pattern length must be fixed.