Search code examples
c#.netregexecma

Why running .Net Regex with ECMAScript flavor support \A


I have a .NetStandard2.1 C# application that needs to run Regex in the ECMAScript flavor.

According to MSDN documentation, I can use RegexOptions.ECMAScript:

Enables ECMAScript-compliant behavior for the expression.

I know that \A anchor is not supported in ECMAScript (According to link and when I tried Regex101 with the ECMAScript option). But it seems that .Net does support it. Example:

Regex emcaRegex = new Regex(@"\A\d{3}", RegexOptions.ECMAScript);
var matches =  emcaRegex.Matches("901-333-");

Console.WriteLine($"number of matches: {matches.Count}"); // number of matches: 1
Console.WriteLine($"The match: {matches[0]}"); // The match: 901

I expect to get not matches at all, what am I missing?


Solution

  • You need to look for the answer further in the "ECMAScript Matching Behavior" article.

    This option does NOT redefine the .NET-specific anchors meanings, they are still supported.

    The behavior of ECMAScript and canonical regular expressions differs in three areas: character class syntax, self-referencing capturing groups, and octal versus backreference interpretation.

    Character class syntax. Because canonical regular expressions support Unicode whereas ECMAScript does not, character classes in ECMAScript have a more limited syntax, and some character class language elements have a different meaning. For example, ECMAScript does not support language elements such as the Unicode category or block elements \p and \P. Similarly, the \w element, which matches a word character, is equivalent to the [a-zA-Z_0-9] character class when using ECMAScript and [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Pc}\p{Lm}] when using canonical behavior. For more information, see Character Classes.

    Self-referencing capturing groups. A regular expression capture class with a backreference to itself must be updated with each capture iteration.

    Resolution of ambiguities between octal escapes and backreferences.

    Regular expression Canonical behavior ECMAScript behavior
    \0 followed by 0 to 2 octal digits Interpret as an octal. For example, \044 is always interpreted as an octal value and means "$". Same behavior.
    \ followed by a digit from 1 to 9, followed by no additional decimal digits, Interpret as a backreference. For example, \9 always means backreference 9, even if a ninth capturing group does not exist. If the capturing group does not exist, the regular expression parser throws an ArgumentException. If a single decimal digit capturing group exists, backreference to that digit. Otherwise, interpret the value as a literal.
    \ followed by a digit from 1 to 9, followed by additional decimal digits Interpret the digits as a decimal value. If that capturing group exists, interpret the expression as a backreference. Otherwise, interpret the leading octal digits up to octal 377; that is, consider only the low 8 bits of the value. Interpret the remaining digits as literals. For example, in the expression \3000, if capturing group 300 exists, interpret as backreference 300; if capturing group 300 does not exist, interpret as octal 300 followed by 0. Interpret as a backreference by converting as many digits as possible to a decimal value that can refer to a capture. If no digits can be converted, interpret as an octal by using the leading octal digits up to octal 377; interpret the remaining digits as literals.