Java regular expression \cx (control characters)

Javadoc for java.util.regex.Pattern says \cx represents The control character corresponding to x. So I thought Pattern.compile() would reject a \c followed by any character other than [@-_], but it doesn't!

As @tchrist commented on one of the answers to What is a regular expression for control characters?, range is not checked at all. I tested a couple characters from higher blocks and also astral planes, looks like it merely flips the 7th lowest bit of the codepoint value.

So is it a Javadoc bug or an implementation bug or am I misunderstanding something? Is \cx a Java-invented syntax or is it supported by other regex engines, especially Perl? How is it handled there?

Solution

All versions of Perl behave the same for the following escapes:

When \c is followed by an ASCII uppercase letter or one of @[\]^_?,

chr(ord($char) ^ 0x40)

This provides full coverage of all ASCII control characters (0x00..0x1F, 0x7F).
```
\c@ === \x00
\cA === \x01
...
\cZ === \x1A
\c[ === \x1B
\c\ === \x1C   # Sometimes \c\\ is needed.
\c] === \x1D
\c^ === \x1E
\c_ === \x1F
\c? === \x7F
```
When \c is followed by an ASCII lowercase letter,

chr(ord($char) ^ 0x60)

This makes the escape case-insensitive.
```
\ca === \cA === \x01
...
\cz === \cZ === \x1A
```

No other sequence make sense, but error checking was only introduced in Perl 5.20.

≥5.20,
- When \c is followed by a space, an ASCII digit or one of !"#$%&'()*+,-./:;<=>{|}~,
  
  chr(ord($char) ^ 0x40), but warns (is more clearly written simply as).
- When \c is followed by an ASCII control character (0x00..0x1F, 0x7F) or a non-ASCII character (≥0x80),
  
  Fatal error Character following "\c" must be printable ASCII.
<5.20,
- When \c is followed by a space, an ASCII digit, one of one of !"#$%&'()*+,-./:;<=>{|}~ or an ASCII control character (0x00..0x1F, 0x7F),
  
  chr(ord($char) ^ 0x40)
- When \c is followed by character ≥0x100,
  
  Total garbage (chr(ord(substr(encode_utf8($char, 0, 1)) ^ 0x40) . encode_utf8($char, 1)).
- When \c is followed by character 0x80..0xFF,
  
  Depending on the internal storage format of the string, produces either chr(ord($char) ^ 0x40) or the same total garbage as for characters ≥0x100.