Search code examples
regexxsdxpath-2.0

Regex character class subtraction with negative groups


This question relates to character class subtraction in regular expression (regex). I refer to the regex flavour of XPATH 2.0 second edition.

When there are negative groups within a character class subtraction, does the subtract operator (-) occur before? or after the negative group operator (^)?

The text of the XPATH/ XML schema specification is below. But to my mind, it reads ambiguously.

For any ·positive character group· or ·negative character group· G, and any ·character class expression· C, G-C is a valid ·character class subtraction·, identifying the set of all characters in C(G) that are not also in C(C).

To be more specific, consider the following three regexes:

  1. [^abc-[ad]]
  2. [^abc-[^ad]]
  3. [abc-[^ad]]

being matched against the haystack text of:

  • abcdef

What are the possible match texts (first and subsequent)?


Solution

  • I don't think that text is ambiguous, if we are lenient enough to read G-C as [G-[C]], and a negative group, ^G, as [^G]. Now, it looks clear that the caret is part of the first group, and does not negate both groups.

    Therefore, [^abc-[ad]] would match:

    {All Characters Besides a, b and c} \ {a and d} = { All Characters Besides a, b, c and d}

    Keep in mind, you can easily test to see the behavior :).
    As a bonus, .Net regular expressions also support this feature, making it a little easier to test online.
    See also: Character Class Subtraction