Search code examples
regexregex-groupboost-regex

Non-capturing subroutines


I was wondering if it was possible to call a subroutine but not capture the result of that call.

For instance, let's say I want to recursively match and capture a balanced bracket {} structure like

{dfsdf{sdfdf{ {dfsdf} }}dfsf}

I could use this regex:

(^(?'nest'\{(?>[^{}]|(?&nest))*\}))

the first group is what I want to capture.

However my definition of 'nest':

(?'nest' ... )

and my recursive call to the 'nest' subroutine:

(?&nest)

are also capturing groups. I would like to make my regex more efficient and save space by not capturing those groups. Is there any way to do this?

edit: I expect it's impossible to not capture a subroutine definition, since its pattern needs to be captured for use elsewhere.


edit2:

I'm testing this regex with boost::regex as well as notepad++ regex. They actually appear define different capturing groups which is odd to me. I'm under the impression that they both use Perl regex by default.

Anyway, upon asking the question, I had the regex:

^\w+\s+[^\s]+\s+(?'header'(?'nest'\{(?>[^{}]|(?&nest))*\}))(?>\s+[^\s]+){5}\s+(?'data'(?>\{(?>[^{}]|(?&nest))*\}))\s+(?'class'(?>\{(?>[^{}]|(?&nest))*\}))

which I later realized contained needless characters that 'nest' already encapsulated. And I now have:

^\w+\s+[^\s]+\s+(?'nest'\{(?>[^{}]|(?&nest))*\})(?>\s+[^\s]+){5}\s+((?&nest))\s+((?&nest))

Notepad++ provides me with 3 capture groups when I do a replace statement

\\1: \1 \n \\2: \2 \n 3: \3 \n 4: \4

It tells me that "1 occurrence was replaced, next occurrence not found". The replacement has no text after the 4:, making me believe that the 4th capture group doesn't exist.

HOWEVER boost::regex_match returns an object with 6 positions:

0: metadata on the match

1: the entire match

2: the entire match

3: group1 from notepad++

4: group2 from notepad++

5: group3 from notepad++

I'm still trying to make send of positions 1 and 2.


edit3

I misunderstood yet another piece of the puzzle...

boost::cmatch.m_subs[i] != boost::cmatch[i]

I thought that they were equal. After some more debugging, it turns out that indexing into the object works exactly like the documentation says. But I incorrectly assumed that the object would contain a structure that mirrored what boost::cmatch[i] returned. It appears that boost::cmatch[i] first removes all entries from m_subs that have matched == false. The remaining entries line up with what boost::cmatch[i] returns.


Solution

  • Any subroutine placed into a (?(DEFINE).) construct won't capture anything.

    If you just want to avoid having any captures, it's done like this

    https://regex101.com/r/aT4TlM/1

    Note the -

    Subpattern definition construct (?(DEFINE)(?'nest'\{(?>[^{}]|(?&nest))*\}))
    May only be used to define functions. No matching is done in this group.

    ^(?&nest)(?(DEFINE)(?'nest'\{(?>[^{}]|(?&nest))*\}))

    And since you have that BOS anchor there ^ it's the only way.
    I.e. (?R) is not an option.

    Expanded

     ^ 
     (?&nest) 
    
     (?(DEFINE)
    
          (?'nest'                      # (1 start)
               \{
               (?>
                    [^{}] 
                 |  (?&nest) 
               )*
               \}
          )                             # (1 end)
     )
    

    Output

      **  Grp 0        -  ( pos 0 , len 29 ) 
     {dfsdf{sdfdf{ {dfsdf} }}dfsf}  
      **  Grp 1 [nest] -  NULL 
    

    Metrics

    ----------------------------------
     * Format Metrics
    ----------------------------------
    Atomic Groups       =   1
    
    Capture Groups      =   1
           Named        =   1
    
    Recursions          =   2
    
    Conditionals        =   1
           DEFINE       =   1
    
    Character Classes   =   1