Search code examples
javaregexreplaceall

Regular expression to remove ASCII control characters in Java


I've been reading that the below pattern used as part of String#replaceAll() in Java

"[\\p{Cntrl}&&[^\r\n\t]]"

removes various non-printable ASCII characters from it.

How does one interpret the above incantation:

  • which characters are included as part of those control chars to be removed?
  • what does the && stand for?
  • does ^ mean it only looks at the beginning of the line?

Can someone please provide a comprehensive non-technical explanation of the above expression?

Thank you in advance.


Solution

  • "... which characters are included as part of those control chars to be removed? ..."

    You can find this information in the Pattern class JavaDoc.

    Pattern – POSIX character classes (US-ASCII only) – (Java SE 20 & JDK 20).

    \p{Cntrl}    A control character: [\x00-\x1F\x7F]

    Which is, from values 0 through 1f, and value 7f.

    "... what does the && stand for? ..."

    The && is part of the syntax for a character class intersection.

    For example, the following will match any character, a through z, except for x and y.

    [a-z&&[^xy]]
    

    "... does ^ mean it only looks at the beginning of the line? ..."

    Not when within a character class, [ ].