I'm trying to collect statements that describe Rectangle 3 using regex (PCRE engine). This is part of a scraping project for a proprietary TGML-ish language. I
The input looks like this:
<Rectangle is
good>99$1</Rectangle>
<Rectangle is
bad>99$2</Rectangle>
<Rectangle is
ugly>3$3</Rectangle>
<Rectangle is
fat>99$4</Rectangle>
<Rectangle is
janky6789>99$5</Rectangle>
<Rectangle is
34+35>99$6</Rectangle>
<Rectangle is
<>>98$7</Rectangle>
<Rectangle is
chicken>3$8</Rectangle>
<Rectangle 1 is
holy>97$9</Rectangle>
And the output to look like this:
<Rectangle is
ugly>3$3</Rectangle>
<Rectangle is
chicken>3$8</Rectangle>
I can get matches that contain Rectangle 3, but they also contain everything before it.
<Rectangle\X*?3\$\X*?<\/Rectangle>
It seems like there should be some kind of grouping or backtracking or recursion trick to this, but I can't figure it out.
You can use a regex with negated character classes rather than lazy matching any graphemes with \X*?
:
<Rectangle[^>]*>3\$[^<]*<\/Rectangle>
See the regex demo.
Note that your \X*?
matches any grapheme including <
, >
, line breaks, etc. and thus it will match as far as necessary to make the subsequent patterns match. Thus, with [^>]*
and [^<]*
you will be able to restict the chars the the pattern can match between fixed subpatterns.
Details
<Rectangle
- a literal string[^>]*
- any zero or more chars other than >
>3\$
- a >3$
string[^<]*
- any zero or more chars other than <
<\/Rectangle>
- a </Rectangle>
string.