Search code examples
regexperlregex-recursion

Unexpected behavior around recursive regex


I am trying to match C++ argument type which can contain balanced <and > characters.

With this regex: (\<(?>[^<>]|(?R))*\>)

On this string: QMap<QgsFeatureId, QPair<QMap<Something, Complex> >>

It matches all expect the first 4 characters (QMap).

Now, if I add \w+ at the start of my regex, it now only matches the end of it (QPair<QMap<Something, Complex> >>) and not the whole string.

What is the explanation and how to solve this?

You can try it online here.

This is intented to use in Perl 5.10+ (5.24).


Solution

  • The (?R) construct recurses the entire pattern. When you add \w+ at the start, it is also accounted for when the recursion takes place. However, what you want to recurse is the Group 1 subpattern.

    You need a subroutine call that will recurse the capturing group subpattern:

    (\w+)(<(?:[^<>]++|(?2))*>)
    

    See the regex demo

    Details

    • (\w+) - Group 1 capturing the identifier (you may change it to [a-zA-Z]\w*)
    • (<(?:[^<>]++|(?2))*>) - Group 2 (that will be recursed)
      • < - a literal <
      • (?:[^<>]++|(?2))* - either 1+ chars other than < and > (possessively, to make it faster) or (|) the whole Group 2 pattern ((?2)).
      • > - a literal >

    Results:

    Match:   QMap<QgsFeatureId, QPair<QMfap<Something, Complex> >>
    Group 1: QMap
    Group 2: <QgsFeatureId, QPair<QMfap<Something, Complex> >>