Search code examples
javaxmlregexinvalid-characters

removing invalid XML characters from a string in java


Hi i would like to remove all invalid XML characters from a string. i would like to use a regular expression with the string.replace method.

like

line.replace(regExp,"");

what is the right regExp to use ?

invalid XML character is everything that is not this :

[#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

thanks.


Solution

  • Java's regex supports supplementary characters, so you can specify those high ranges with two UTF-16 encoded chars, or, even easier, use \x to specify any valid code point.

    Here is the pattern for removing characters that are illegal in XML 1.0:

    // XML 1.0
    // #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
    String xml10pattern = "[^"
                        + "\u0009\r\n"
                        + "\u0020-\uD7FF"
                        + "\uE000-\uFFFD"
                        + "\x{10000}-\x{10FFFF}"
                        + "]";
    

    Most people will want the XML 1.0 version.

    Here is the pattern for removing characters that are illegal in XML 1.1:

    // XML 1.1
    // [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
    String xml11pattern = "[^"
                        + "\u0001-\uD7FF"
                        + "\uE000-\uFFFD"
                        + "\x{10000}-\x{10FFFF}"
                        + "]+";
    

    You will need to use String.replaceAll(...) and not String.replace(...).

    String illegal = "Hello, World!\0";
    String legal = illegal.replaceAll(pattern, "");