Search code examples
.netregexoctalbackreference

What does the .NET Regex "\19abc" match?


I am studying .NET Regular Expression. As known there is an ambiguity between octal escape codes (such as \16) and \number backreferences. https://msdn.microsoft.com/en-us/library/thwdfzxy.aspx

My question is: What does regular expressions like \19 or \288 match when the group with that number is not defined in the regex pattern?

Neither it is a valid group number nor it is a valid octal code. But it is a valid regular expression (even \14848486 is valid) - the Regex constructor does not throw an ArgumentException but I could not find any input string that match such an escape sequence.

I am just curious how to interpret just an expression.


Solution

  • Ambiguity arises when there are several possibilities to parse a pattern. Say, in (.)(.)(.)(.)(.)(.)(.)(.)(.)(.)\10 pattern with 10 capturing groups we can speak of an ambiguity since 1 and 10 can both refer to the existing group, but .NET regex engine resolves this ambiguity to the biggest possible value, and this regex won't match 12345678901, but will match 12345678900. To get rid of the ambiguity, you need to use \k<ID> backreferences. (.)(.)(.)(.)(.)(.)(.)(.)(.)(.)\k<1>0 would match 123456789010, but not 123456789000.

    The \14848486 pattern matches an octal 14 character, and then a sequence of 848486 as there is no ambiguity here.

    The \18848486 pattern will match an octal 1 char, and then a sequence of 8848486 chars. See this C# demo:

    var s = Regex.Match("\u00018848486", @"\18848486");
    if (s.Success) Console.WriteLine(s.Value); // => 8848486
    

    I also suggest using Ultrapico Expresso (no affiliation) to debug .NET regexps, see the screenshot:

    enter image description here

    As for \19 and \288:

    enter image description here

    Besides, when you use a backreference to the group that is missing, as in \k<1>8848486, you will get a System.ArgumentException: parsing '\k<1>8848486' - Reference to undefined group number N exception. When you have 8 or 9 after \, as in \8848486, you will get System.ArgumentException: parsing '\8848486' - Unrecognized escape sequence N exception.