Search code examples
regexregex-groupcapturing-group

Regex recursion captured string


I have a problem with a regex that has to capture a substring that it's already captured...

I have this regex:

(?<domain>\w+\.\w+)($|\/|\.)

And I want to capture every subdomain recursively. For example, in this string:

test1.test2.abc.def

This expression captures test1.test2 and abc.def but I need to capture: test1.test2 test2.abc abc.def

Do you know if there is any option to do this recursively?

Thanks!


Solution

  • You may use a well-known technique to extract overlapping matches, but you can't rely on \b boundaries as they can match between a non-word / word char and word / non-word char. You need unambiguous word boundaries for left and right hand contexts.

    Use

    (?=(?<!\w)(?<domain>\w+\.\w+)(?!\w))
    

    See the regex demo. Details:

    • (?= - a positive lookahead that enables testing each location in the string and capture the part of string to the right of it
      • (?<!\w) - a left-hand side word boundary
      • (?<domain>\w+\.\w+) - Group "domain": 1+ word chars, . and 1+ word chars
      • (?!\w) - a right-hand side word boundary
    • ) - end of the outer lookahead.

    Another approach is to use dots as word delimiters. Then use

    (?=(?<![^.])(?<domain>[^.]+\.[^.]+)(?![^.]))
    

    See this regex demo. Adjust as you see fit.