regex recursion regex-group regex-greedy

Regex match multiline entries containing specified string

I'm trying to collect statements that describe Rectangle 3 using regex (PCRE engine). This is part of a scraping project for a proprietary TGML-ish language. I

The input looks like this:

<Rectangle is 
    good>99$1</Rectangle>
<Rectangle is 
    bad>99$2</Rectangle>
<Rectangle is 
    ugly>3$3</Rectangle>
<Rectangle is 
    fat>99$4</Rectangle>
<Rectangle is 
    janky6789>99$5</Rectangle>
<Rectangle is 
    34+35>99$6</Rectangle>
<Rectangle is 
    <>>98$7</Rectangle>
<Rectangle is 
    chicken>3$8</Rectangle>
<Rectangle 1 is 
    holy>97$9</Rectangle>

And the output to look like this:

<Rectangle is 
    ugly>3$3</Rectangle>
<Rectangle is 
    chicken>3$8</Rectangle>

I can get matches that contain Rectangle 3, but they also contain everything before it.

<Rectangle\X*?3\$\X*?<\/Rectangle>

It seems like there should be some kind of grouping or backtracking or recursion trick to this, but I can't figure it out.

Solution

You can use a regex with negated character classes rather than lazy matching any graphemes with \X*?:

<Rectangle[^>]*>3\$[^<]*<\/Rectangle>

See the regex demo.

Note that your \X*? matches any grapheme including <, >, line breaks, etc. and thus it will match as far as necessary to make the subsequent patterns match. Thus, with [^>]* and [^<]* you will be able to restict the chars the the pattern can match between fixed subpatterns.

Details

<Rectangle - a literal string
[^>]* - any zero or more chars other than >
>3\$ - a >3$ string
[^<]* - any zero or more chars other than <
<\/Rectangle> - a </Rectangle> string.