Search code examples
javaunicodecharacterdetection

how to check if XML data is valid UTF-8 and detect incorrect characters?


In my application i have to validate XML data and pickup all invalid characters (put them in CDATA)

My question is quite simple... ^^ how to do it?

I started with Character.UnicodeBlock methods, but for characters incoded into several bytes - for example 'ï' or 'é', how does it works ?

This my code at the moment (to make tests):

public static void main(String[] args) {

try {
    byte[] data = "J'ai prïé et `".getBytes("UTF-8");

    System.out.print("Data: ");
    for (int i = 0; i < data.length; i++) {
    System.out.print((char) data[i]);
    }

    System.out.println("");

    UnicodeBlock myBlock = null;

    for (int i = 0; i < data.length; i++) {
    System.out.println("[" + i + " => '" + (char) data[i]
        + "'] Is defined: "
        + Character.isDefined(new Byte(data[i]).intValue()));
    try {
        myBlock = Character.UnicodeBlock.of(new Byte(data[i])
            .intValue());
    } catch (IllegalArgumentException e) {
        System.out
            .println("Count => "
                + Character.charCount(new Byte(data[i])
                    .intValue()));
    }
    }
} catch (UnsupportedEncodingException e) {
    System.err.println("Unsupported encoding: " + e.getMessage());
}
System.out.println("Finished");
}

And this is what i get at execution:

Data: J'ai pr???? et `
[0 => 'J'] Is defined: true
[1 => '''] Is defined: true
[2 => 'a'] Is defined: true
[3 => 'i'] Is defined: true
[4 => ' '] Is defined: true
[5 => 'p'] Is defined: true
[6 => 'r'] Is defined: true
[7 => '?'] Is defined: false
Count => 1
[8 => '?'] Is defined: false
Count => 1
[9 => '?'] Is defined: false
Count => 1
[10 => '?'] Is defined: false
Count => 1
[11 => ' '] Is defined: true
[12 => 'e'] Is defined: true
[13 => 't'] Is defined: true
[14 => ' '] Is defined: true
[15 => '`'] Is defined: true
Finished

I'm trying to find a way to also detect multiple byte characters, and only have 'false' result for real incorrect characters.

Maybe a library in Java already exists to do that?

Would be very kind if someone can help me. Thanks in advance.

Regards.


Solution

  • A few things:

    • CDATA will not protect you from invalid characters; your junk data will still be illegal UTF-8 sequences and may be rejected by XML parsers
    • use a configured CharsetDecoder with an InputStreamReader to validate character sequences; alternatively, check byte sequences are valid by checking them as described in RFC 2279 (see the UTF-8 definition)
    • I wouldn't try parsing XML without an XML parser
    • Character.isDefined expects a UTF-16BE encoded char (or a UTF-32BE encoded int), not UTF-8 encoded bytes
    • in Java 6, Character.isDefined is limited to code points defined in Unicode Standard, version 4.0.; there may be valid UTF-8 documents defined by later standards for which this will fail (version 6 is out now); the latest list of valid code points is defined in UnicodeData.txt