Search code examples
javaregexbackreferencecapturing-group

Why does backreferencing capturing groups work for multiple digit numbers in Java?


Let's say that you have a string:

String string = "ab #1?AZa$ab #1?AZa$"

You're trying to verify that the tenth is a non-whitespace character, and that the twentieth character is the same as the tenth. Furthermore, there is corresponding verification with the 1st and 11th, the 2nd and 12th, the 3rd and 13th, etc. each with their own separate requirements (the full list is here) so you have to use 10 capturing groups. I found that the following regex still works to validate the aforementioned string:

string.matches("^([a-z])(\\w)(\\s)(\\W)(\\d)(\\D)([A-Z])([a-zA-Z])([aeiouAEIOU])(\\S)\\1\\2\\3\\4\\5\\6\\7\\8\\9\\10$") //returns true

My question regards the last backreference:

\\10

Shouldn't this be interpreted as "match with the first character" and then "match with 0" (the digit)? I don't see how this is interpreted as "match with the tenth character" without somehow grouping the 1 and 0 together into 10. Puzzlingly, surrounding the 1 and 0 with parentheses does not work.


Solution

  • The behavior for Java is documented in Pattern:

    In this class, \1 through \9 are always interpreted as back references, and a larger number is accepted as a back reference if at least that many subexpressions exist at that point in the regular expression, otherwise the parser will drop digits until the number is smaller or equal to the existing number of groups or it is one digit.