Search code examples
javaregexcapturing-group

Capturing <thisPartOnly> and (thisPartOnly) with the same group


Let's say we have the following input:

<amy>
(bob)
<carol)
(dean>

We also have the following regex:

<(\w+)>|\((\w+)\)

Now we get two matches (as seen on rubular.com):

  • <amy> is a match, \1 captures amy, \2 fails
  • (bob) is a match, \2 captures bob, \1 fails

This regex does most of what we want, which are:

  • It matches the open and close brackets properly (i.e. no mixing)
  • It captures the part we're interested in

However, it does have a few drawbacks:

  • The capturing pattern (i.e. the "main" part) is repeated
    • It's only \w+ in this case, but generally speaking this can be quite complex,
      • If it involves backreferences, then they must be renumbered for each alternate!
      • Repetition makes maintenance a nightmare! (what if it changes?)
  • The groups are essentially duplicated
    • Depending on which alternate matches, we must query different groups
      • It's only \1 or \2 in this case, but generally the "main" part can have capturing groups of their own!
    • Not only is this inconvenient, but there may be situations where this is not feasible (e.g. when we're using a custom regex framework that is limited to querying only one group)
  • The situation quickly worsens if we also want to match {...}, [...], etc.

So the question is obvious: how can we do this without repeating the "main" pattern?

Note: for the most part I'm interested in java.util.regex flavor, but other flavors are welcomed.


Appendix

There's nothing new in this section; it only illustrates the problem mentioned above with an example.

Let's take the above example to the next step: we now want to match these:

<amy=amy>
(bob=bob)
[carol=carol]

But not these:

<amy=amy)   # non-matching bracket
<amy=bob>   # left hand side not equal to right hand side

Using the alternate technique, we have the following that works (as seen on rubular.com):

<((\w+)=\2)>|\(((\w+)=\4)\)|\[((\w+)=\6)\]

As explained above:

  • The main pattern can't simply be repeated; backreferences must be renumbered
  • Repetition also means maintenance nightmare if it ever changes
  • Depending on which alternate matches, we must query either \1 \2, \3 \4, or \5 \6

Solution

  • You can use a lookahead to "lock in" the group number before doing the real match.

    String s = "<amy=amy>(bob=bob)[carol=carol]";
    Pattern p = Pattern.compile(
      "(?=[<(\\[]((\\w+)=\\2))(?:<\\1>|\\(\\1\\)|\\[\\1\\])");
    Matcher m = p.matcher(s);
    
    while(m.find())
    {
      System.out.printf("found %s in %s%n", m.group(2), m.group());
    }
    

    output:

    found amy in <amy=amy>
    found bob in (bob=bob)
    found carol in [carol=carol]
    

    It's still ugly as hell, but you don't have to recalculate all the group numbers every time you make a change. For example, to add support for curly brackets, it's just:

    "(?=[<(\\[{]((\\w+)=\\2))(?:<\\1>|\\(\\1\\)|\\[\\1\\]|\\{\\1\\})"