In a large data set I have some data that looks like this:
"guide (but, yeah, it’s okay to share it with ‘em)."
I've opened the file in a hex editor and run the raw byte data through a character encoding detection algorithm (http://code.google.com/p/juniversalchardet/) and it's positively detected as UTF-8.
It appears to me that the source of the data mis-interpreted the original character set and wrote valid UTF-8 as the output that I have received.
I'd like to validate the data to the best I can. Are there any heuristics/algorithms out there that might help me take a stab at validation?
You cannot do that once you have the string, you have to do it while you still have the raw input. Once you have the string, there is no way to automatically tell whether ’
was actually intended input without some seriously fragile tests. For example:
public static boolean isUTF8MisInterpreted( String input ) {
//convenience overload for the most common UTF-8 misinterpretation
//which is also the case in your question
return isUTF8MisInterpreted( input, "Windows-1252");
}
public static boolean isUTF8MisInterpreted( String input, String encoding) {
CharsetDecoder decoder = Charset.forName("UTF-8").newDecoder();
CharsetEncoder encoder = Charset.forName(encoding).newEncoder();
ByteBuffer tmp;
try {
tmp = encoder.encode(CharBuffer.wrap(input));
}
catch(CharacterCodingException e) {
return false;
}
try {
decoder.decode(tmp);
return true;
}
catch(CharacterCodingException e){
return false;
}
}
public static void main(String args[]) {
String test = "guide (but, yeah, it’s okay to share it with ‘em).";
String test2 = "guide (but, yeah, it’s okay to share it with ‘em).";
System.out.println( isUTF8MisInterpreted(test)); //true
System.out.println( isUTF8MisInterpreted(test2)); //false
}
If you still have access to raw input, you can see if a byte array amounts to fully valid utf-8 byte sequences with this:
public static boolean isValidUTF8( byte[] input ) {
CharsetDecoder cs = Charset.forName("UTF-8").newDecoder();
try {
cs.decode(ByteBuffer.wrap(input));
return true;
}
catch(CharacterCodingException e){
return false;
}
}
You can also use the CharsetDecoder with streams, by default it throws exception as soon as it sees invalid bytes in the given encoding.