I am studying .NET Regular Expression. As known there is an ambiguity between octal escape codes (such as \16) and \number backreferences. https://msdn.microsoft.com/en-us/library/thwdfzxy.aspx
My question is:
What does regular expressions like \19
or \288
match when the group with that number is not defined in the regex pattern?
Neither it is a valid group number nor it is a valid octal code.
But it is a valid regular expression (even \14848486
is valid) - the Regex constructor does not throw an ArgumentException but I could not find any input string that match such an escape sequence.
I am just curious how to interpret just an expression.
Ambiguity arises when there are several possibilities to parse a pattern. Say, in (.)(.)(.)(.)(.)(.)(.)(.)(.)(.)\10
pattern with 10 capturing groups we can speak of an ambiguity since 1
and 10
can both refer to the existing group, but .NET regex engine resolves this ambiguity to the biggest possible value, and this regex won't match 12345678901
, but will match 12345678900
. To get rid of the ambiguity, you need to use \k<ID>
backreferences. (.)(.)(.)(.)(.)(.)(.)(.)(.)(.)\k<1>0
would match 123456789010
, but not 123456789000
The \14848486
pattern matches an octal 14 character, and then a sequence of 848486
as there is no ambiguity here.
The \18848486
pattern will match an octal 1 char, and then a sequence of 8848486
chars. See this C# demo:
var s = Regex.Match("\u00018848486", @"\18848486");
if (s.Success) Console.WriteLine(s.Value); // => 8848486
I also suggest using Ultrapico Expresso (no affiliation) to debug .NET regexps, see the screenshot:
As for \19
and \288
Besides, when you use a backreference to the group that is missing, as in \k<1>8848486
, you will get a System.ArgumentException: parsing '\k<1>8848486' - Reference to undefined group number N
exception. When you have 8
or 9
after \
, as in \8848486
, you will get System.ArgumentException: parsing '\8848486' - Unrecognized escape sequence N