Search code examples
javaregexperlcontrol-characters

Java regular expression \cx (control characters)


Javadoc for java.util.regex.Pattern says \cx represents The control character corresponding to x. So I thought Pattern.compile() would reject a \c followed by any character other than [@-_], but it doesn't!

As @tchrist commented on one of the answers to What is a regular expression for control characters?, range is not checked at all. I tested a couple characters from higher blocks and also astral planes, looks like it merely flips the 7th lowest bit of the codepoint value.

So is it a Javadoc bug or an implementation bug or am I misunderstanding something? Is \cx a Java-invented syntax or is it supported by other regex engines, especially Perl? How is it handled there?


Solution

  • All versions of Perl behave the same for the following escapes:

    • When \c is followed by an ASCII uppercase letter or one of @[\]^_?,

      chr(ord($char) ^ 0x40)

      This provides full coverage of all ASCII control characters (0x00..0x1F, 0x7F).

      \c@ === \x00
      \cA === \x01
      ...
      \cZ === \x1A
      \c[ === \x1B
      \c\ === \x1C   # Sometimes \c\\ is needed.
      \c] === \x1D
      \c^ === \x1E
      \c_ === \x1F
      \c? === \x7F
      
    • When \c is followed by an ASCII lowercase letter,

      chr(ord($char) ^ 0x60)

      This makes the escape case-insensitive.

      \ca === \cA === \x01
      ...
      \cz === \cZ === \x1A
      

    No other sequence make sense, but error checking was only introduced in Perl 5.20.

    • ≥5.20,

      • When \c is followed by a space, an ASCII digit or one of !"#$%&'()*+,-./:;<=>{|}~,

        chr(ord($char) ^ 0x40), but warns (is more clearly written simply as).

      • When \c is followed by an ASCII control character (0x00..0x1F, 0x7F) or a non-ASCII character (≥0x80),

        Fatal error Character following "\c" must be printable ASCII.

    • <5.20,

      • When \c is followed by a space, an ASCII digit, one of one of !"#$%&'()*+,-./:;<=>{|}~ or an ASCII control character (0x00..0x1F, 0x7F),

        chr(ord($char) ^ 0x40)

      • When \c is followed by character ≥0x100,

        Total garbage (chr(ord(substr(encode_utf8($char, 0, 1)) ^ 0x40) . encode_utf8($char, 1)).

      • When \c is followed by character 0x80..0xFF,

        Depending on the internal storage format of the string, produces either chr(ord($char) ^ 0x40) or the same total garbage as for characters ≥0x100.