In my application i have to validate XML data and pickup all invalid characters (put them in CDATA)
My question is quite simple... ^^ how to do it?
I started with Character.UnicodeBlock methods, but for characters incoded into several bytes - for example 'ï' or 'é', how does it works ?
This my code at the moment (to make tests):
public static void main(String[] args) {
try {
byte[] data = "J'ai prïé et `".getBytes("UTF-8");
System.out.print("Data: ");
for (int i = 0; i < data.length; i++) {
System.out.print((char) data[i]);
}
System.out.println("");
UnicodeBlock myBlock = null;
for (int i = 0; i < data.length; i++) {
System.out.println("[" + i + " => '" + (char) data[i]
+ "'] Is defined: "
+ Character.isDefined(new Byte(data[i]).intValue()));
try {
myBlock = Character.UnicodeBlock.of(new Byte(data[i])
.intValue());
} catch (IllegalArgumentException e) {
System.out
.println("Count => "
+ Character.charCount(new Byte(data[i])
.intValue()));
}
}
} catch (UnsupportedEncodingException e) {
System.err.println("Unsupported encoding: " + e.getMessage());
}
System.out.println("Finished");
}
And this is what i get at execution:
Data: J'ai pr???? et `
[0 => 'J'] Is defined: true
[1 => '''] Is defined: true
[2 => 'a'] Is defined: true
[3 => 'i'] Is defined: true
[4 => ' '] Is defined: true
[5 => 'p'] Is defined: true
[6 => 'r'] Is defined: true
[7 => '?'] Is defined: false
Count => 1
[8 => '?'] Is defined: false
Count => 1
[9 => '?'] Is defined: false
Count => 1
[10 => '?'] Is defined: false
Count => 1
[11 => ' '] Is defined: true
[12 => 'e'] Is defined: true
[13 => 't'] Is defined: true
[14 => ' '] Is defined: true
[15 => '`'] Is defined: true
Finished
I'm trying to find a way to also detect multiple byte characters, and only have 'false' result for real incorrect characters.
Maybe a library in Java already exists to do that?
Would be very kind if someone can help me. Thanks in advance.
Regards.
A few things:
Character.isDefined
expects a UTF-16BE encoded char
(or a UTF-32BE encoded int
), not UTF-8 encoded bytesCharacter.isDefined
is limited to code points defined in Unicode Standard, version 4.0.; there may be valid UTF-8 documents defined by later standards for which this will fail (version 6 is out now); the latest list of valid code points is defined in UnicodeData.txt