I think best is to ask by example.
I'm parsing something like below (rectangle and text definitions in a PDF content):
---
---
85.039 42.52 42.519 42.52 re
W--
---
---
127.559 42.52 42.519 42.52 re
W--
---
---
170.078 42.52 42.52 42.52 re
W--
---
---
BT
---
Text
---
ET
---
---
170.078 42.52 42.52 42.52 re
W--
---
---
127.559 42.52 42.519 42.52 re
W--
---
---
BT
---
Text
---
ET
---
---
170.078 42.52 42.52 42.52 re
W--
---
---
BT
---
Text
---
ET
---
---
Dashes for example only, in the real data dashes can be anything (various control characters / numerics / matrices / whatnot).
Currently I'm capturing these groups:
# Clipping Rectangle
(?<x>\b[-0-9\.]+\b)(\s)
(?<y>\b[-0-9\.]+\b)(\s)
(?<width>\b[-0-9\.]+\b)(\s)
(?<height>\b[-0-9\.]+\b)(\s)
(re\nW)
(.*?)
# Text
(BT)
(?<text>.*?)
(ET)
But in these matches, the first (furthest) clipping rectangle is captured instead of the last (closest).
How can I capture the closest clipping groups to the text groups using Regex?
UPDATE: See on Regex101.
You could add a negative lookahead for the 4 numbers followed by re\nW
after each character in the .*?
match to exclude the pattern from occurring again before the match with the text:
# Clipping Rectangle
(?<x>\b[-0-9\.]+\b)(\s)
(?<y>\b[-0-9\.]+\b)(\s)
(?<width>\b[-0-9\.]+\b)(\s)
(?<height>\b[-0-9\.]+\b)(\s)
(re\nW)
((?:.(?!(\b[-\d.]+\b\s){4}re\nW))*?)
# Text
(BT)
(?<text>.*?)
(ET)