Search code examples
swiftregexregex-group

How to capture the "closest" group (regex)?


I think best is to ask by example.

I'm parsing something like below (rectangle and text definitions in a PDF content):

---
---
85.039 42.52 42.519 42.52 re
W--
---
---
127.559 42.52 42.519 42.52 re
W--
---
---
170.078 42.52 42.52 42.52 re
W--
---
---
BT
---
Text
---
ET
---
---
170.078 42.52 42.52 42.52 re
W--
---
---
127.559 42.52 42.519 42.52 re
W--
---
---
BT
---
Text
---
ET
---
---
170.078 42.52 42.52 42.52 re
W--
---
---
BT
---
Text
---
ET
---
---

Dashes for example only, in the real data dashes can be anything (various control characters / numerics / matrices / whatnot).

Currently I'm capturing these groups:

# Clipping Rectangle
(?<x>\b[-0-9\.]+\b)(\s)
(?<y>\b[-0-9\.]+\b)(\s)
(?<width>\b[-0-9\.]+\b)(\s)
(?<height>\b[-0-9\.]+\b)(\s)
(re\nW)

(.*?)

# Text
(BT)
(?<text>.*?)
(ET)

But in these matches, the first (furthest) clipping rectangle is captured instead of the last (closest).
How can I capture the closest clipping groups to the text groups using Regex?

I have this:
enter image description here

But I want this:
enter image description here


UPDATE: See on Regex101.


Solution

  • You could add a negative lookahead for the 4 numbers followed by re\nW after each character in the .*? match to exclude the pattern from occurring again before the match with the text:

    # Clipping Rectangle
    (?<x>\b[-0-9\.]+\b)(\s)
    (?<y>\b[-0-9\.]+\b)(\s)
    (?<width>\b[-0-9\.]+\b)(\s)
    (?<height>\b[-0-9\.]+\b)(\s)
    (re\nW)
    
    ((?:.(?!(\b[-\d.]+\b\s){4}re\nW))*?)
    
    # Text
    (BT)
    (?<text>.*?)
    (ET)
    

    Demo on regex101