Search code examples
regexrecursionregex-groupregex-greedy

Regex match multiline entries containing specified string


I'm trying to collect statements that describe Rectangle 3 using regex (PCRE engine). This is part of a scraping project for a proprietary TGML-ish language. I

The input looks like this:

<Rectangle is 
    good>99$1</Rectangle>
<Rectangle is 
    bad>99$2</Rectangle>
<Rectangle is 
    ugly>3$3</Rectangle>
<Rectangle is 
    fat>99$4</Rectangle>
<Rectangle is 
    janky6789>99$5</Rectangle>
<Rectangle is 
    34+35>99$6</Rectangle>
<Rectangle is 
    <>>98$7</Rectangle>
<Rectangle is 
    chicken>3$8</Rectangle>
<Rectangle 1 is 
    holy>97$9</Rectangle>

And the output to look like this:

<Rectangle is 
    ugly>3$3</Rectangle>
<Rectangle is 
    chicken>3$8</Rectangle>

I can get matches that contain Rectangle 3, but they also contain everything before it.

<Rectangle\X*?3\$\X*?<\/Rectangle>

It seems like there should be some kind of grouping or backtracking or recursion trick to this, but I can't figure it out.


Solution

  • You can use a regex with negated character classes rather than lazy matching any graphemes with \X*?:

    <Rectangle[^>]*>3\$[^<]*<\/Rectangle>
    

    See the regex demo.

    Note that your \X*? matches any grapheme including <, >, line breaks, etc. and thus it will match as far as necessary to make the subsequent patterns match. Thus, with [^>]* and [^<]* you will be able to restict the chars the the pattern can match between fixed subpatterns.

    Details

    • <Rectangle - a literal string
    • [^>]* - any zero or more chars other than >
    • >3\$ - a >3$ string
    • [^<]* - any zero or more chars other than <
    • <\/Rectangle> - a </Rectangle> string.