Javadoc for java.util.regex.Pattern
says \cx
represents The control character corresponding to x. So I thought Pattern.compile()
would reject a \c
followed by any character other than [@-_]
, but it doesn't!
As @tchrist commented on one of the answers to What is a regular expression for control characters?, range is not checked at all. I tested a couple characters from higher blocks and also astral planes, looks like it merely flips the 7th lowest bit of the codepoint value.
So is it a Javadoc bug or an implementation bug or am I misunderstanding something? Is \cx
a Java-invented syntax or is it supported by other regex engines, especially Perl? How is it handled there?
All versions of Perl behave the same for the following escapes:
When \c
is followed by an ASCII uppercase letter or one of @[\]^_?
,
chr(ord($char) ^ 0x40)
This provides full coverage of all ASCII control characters (0x00
..0x1F
, 0x7F
).
\c@ === \x00
\cA === \x01
...
\cZ === \x1A
\c[ === \x1B
\c\ === \x1C # Sometimes \c\\ is needed.
\c] === \x1D
\c^ === \x1E
\c_ === \x1F
\c? === \x7F
When \c
is followed by an ASCII lowercase letter,
chr(ord($char) ^ 0x60)
This makes the escape case-insensitive.
\ca === \cA === \x01
...
\cz === \cZ === \x1A
No other sequence make sense, but error checking was only introduced in Perl 5.20.
≥5.20,
When \c
is followed by a space, an ASCII digit or one of !"#$%&'()*+,-./:;<=>{|}~
,
chr(ord($char) ^ 0x40)
, but warns (is more clearly written simply as
).
When \c
is followed by an ASCII control character (0x00
..0x1F
, 0x7F
) or a non-ASCII character (≥0x80
),
Fatal error Character following "\c" must be printable ASCII
.
<5.20,
When \c
is followed by a space, an ASCII digit, one of one of !"#$%&'()*+,-./:;<=>{|}~
or an ASCII control character (0x00
..0x1F
, 0x7F
),
chr(ord($char) ^ 0x40)
When \c
is followed by character ≥0x100
,
Total garbage (chr(ord(substr(encode_utf8($char, 0, 1)) ^ 0x40) . encode_utf8($char, 1)
).
When \c
is followed by character 0x80
..0xFF
,
Depending on the internal storage format of the string, produces either chr(ord($char) ^ 0x40)
or the same total garbage as for characters ≥0x100
.