We have a byte sequence input and we need to check if it's UTF-8 or plain ASCII or something else. In other words, we have to reject ISO-8859-X latin-x or other encoded input.
Our first choice was Tika, but we have a problem with it: plain ascii input (input with no accented chars at all) is often detected as ISO-8859-2 or ISO-8859-1 !
This is the problematic part:
CharsetDetector detector = new CharsetDetector();
String ascii = "Only ascii Visible:a;Invisible:GUID\nX;XXddd\n";
detector.setText(ascii.getBytes());
System.out.println("detected charset: " + detector.detect().getName());
String ascii2 = "Only ascii plain english text";
detector.setText(ascii2.getBytes());
System.out.println("detected charset: " + detector.detect().getName());
String ascii3 = "this is ISO-8859-2 do not know why";
detector.setText(ascii3.getBytes());
System.out.println("detected charset: " + detector.detect().getName());
String ascii4 = "this is UTF-8 but tell me why o why maybe sdlkfjlksdjlkfjlksdjflkjlskdjflkjsdjkflkdsjlkfjldsjlkfjldkjkfljdlkjsdfhjshdkjfhjksdhjfkksdfksjdfhkjsdhj";
detector.setText(ascii4.getBytes());
System.out.println("detected charset: " + detector.detect().getName());
This is the output
detected charset: ISO-8859-2
detected charset: ISO-8859-1
detected charset: ISO-8859-2
detected charset: UTF-8
How should I use Tika to get sensible results?
Ps: Here is a mini demo: https://github.com/riskop/tikaproblem
There's a detectAll() method on detector, with that one can get all the encodings Tika considered matching to the input. I can solve my problem with following this rule: if UTF-8 is among the matching encodings the input is accepted (because it is possibly UTF-8), else the input is rejected as not UTF-8.
I understand that Tika must use heuristics, and I understand that there are inputs which can be valid UTF-8 or other encoded texts at the same time.
So for example
bytes = "Only ascii plain english text".getBytes("UTF-8");
printCharsetArray(new CharsetDetector().setText(bytes).detectAll());
results in:
Match of ISO-8859-1 in nl with confidence 40
Match of ISO-8859-2 in ro with confidence 30
Match of UTF-8 with confidence 15
Match of ISO-8859-9 in tr with confidence 10
Match of Big5 in zh with confidence 10
Match of EUC-KR in ko with confidence 10
Match of EUC-JP in ja with confidence 10
Match of GB18030 in zh with confidence 10
Match of Shift_JIS in ja with confidence 10
Match of UTF-16LE with confidence 10
Match of UTF-16BE with confidence 10
Which is usable in my case, although the two "best" matches are ISO-8859-1 and 2, the third best is UTF-8, so I can accept the input.
For invalid UTF-8 input it seems also working.
For example 0xc3, 0xa9, 0xa9
bytes = new byte[]{(byte)0xC3, (byte)0xA9, (byte)0xA9}; // illegal utf-8: Cx leading byte followed by two continuation bytes
printCharsetArray(new CharsetDetector().setText(bytes).detectAll());
results:
Match of Big5 in zh with confidence 10
Match of EUC-KR in ko with confidence 10
Match of EUC-JP in ja with confidence 10
Match of GB18030 in zh with confidence 10
Which is good, there's no UTF-8 among the matches.
A more likely input is text with accented chars encoded with not UTF-8 encoding:
bytes = "this is somethingó not utf8 é".getBytes("ISO-8859-2");
printCharsetArray(new CharsetDetector().setText(bytes).detectAll());
results:
Match of ISO-8859-2 in hu with confidence 31
Match of ISO-8859-1 in en with confidence 31
Match of KOI8-R in ru with confidence 10
Match of UTF-16LE with confidence 10
Match of UTF-16BE with confidence 10
Which is good, because no UTF-8 among the results.