Search code examples
javaregexcapturing-group

Java regex: how to back-reference capturing groups in a certain context when their number is not known in advance


As an introductory note, I am aware of the old saying about solving problems with regex and I am also aware about the precautions on processing XML with RegEx. But please bear with me for a moment...

I am trying to do a RegEx search and replace on a group of characters. I don't know in advance how often this group will be matched, but I want to search with a certain context only.

An example: If I have the following string "**ab**df**ab**sdf**ab**fdsa**ab**bb" and I want to search for "ab" and replace with "@ab@", this works fine using the following regex:

Search regex:

(.*?)(ab)(.*?)

Replace:

$1@$2@$3

I get four matches in total, as expected. Within each match, the group IDs are the same, so the back-references ($1, $2 ...) work fine, too.

However, if I now add a certain context to the string, the regex above fails:

Search string:

<context>abdfabsdfabfdsaabbb</context>

Search regex:

<context>(.*?)(ab)(.*?)</context>

This will find only the first match. But even if I add a non-capturing group to the original regex, it doesn't work ("<context>(?:(.*?)(ab)(.*?))*</context>").

What I would like is a list of matches as in the first search (without the context), whereby within each match the group IDs are the same.

Any idea how this could be achieved?


Solution

  • Solution

    Your requirement is similar to the one in this question: match and capture multiple instances of a pattern between a prefix and a suffix. Using the method as described in this answer of mine:

    (?s)(?:<context>|(?!^)\G)(?:(?!</context>|ab).)*ab
    

    Add capturing group as you need.

    Caveat

    Note that the regex only works for tags that are only allowed to contain only text. If a tag contains other tags, then it won't work correctly.

    It also matches ab inside <context> tag without a closing tag </context>. If you want to prevent this then:

    (?s)(?:<context>(?=.*?</context>)|(?!^)\G)(?:(?!</context>|ab).)*ab
    

    Explanation

    Let us break down the regex:

    (?s)                        # Make . matches any character, without exception
    (?:
      <context>
        |
      (?!^)\G
    )
    (?:(?!</context>|ab).)*
    ab
    

    (?:<context>|(?!^)\G) makes sure that we either gets inside a new <context> tag, or continue from the previous match and attempt to match more instance of sub-pattern.

    (?:(?!</context>|ab).)* match whatever text that we don't care about (not ab) and prevent us from going past the closing tag </context>. Then we match the pattern we want ab at the end.