Tika is not detecting plain ascii input

We have a byte sequence input and we need to check if it's UTF-8 or plain ASCII or something else. In other words, we have to reject ISO-8859-X latin-x or other encoded input.

Our first choice was Tika, but we have a problem with it: plain ascii input (input with no accented chars at all) is often detected as ISO-8859-2 or ISO-8859-1 !

This is the problematic part:

    CharsetDetector detector = new CharsetDetector();
    String ascii = "Only ascii Visible:a;Invisible:GUID\nX;XXddd\n";
    detector.setText(ascii.getBytes());
    System.out.println("detected charset: " + detector.detect().getName());
    String ascii2 = "Only ascii plain english text";
    detector.setText(ascii2.getBytes());
    System.out.println("detected charset: " + detector.detect().getName());
    String ascii3 = "this is ISO-8859-2 do not know why";
    detector.setText(ascii3.getBytes());
    System.out.println("detected charset: " + detector.detect().getName());
    String ascii4 = "this is UTF-8 but tell me why o why maybe sdlkfjlksdjlkfjlksdjflkjlskdjflkjsdjkflkdsjlkfjldsjlkfjldkjkfljdlkjsdfhjshdkjfhjksdhjfkksdfksjdfhkjsdhj";
    detector.setText(ascii4.getBytes());
    System.out.println("detected charset: " + detector.detect().getName());

This is the output

detected charset: ISO-8859-2
detected charset: ISO-8859-1
detected charset: ISO-8859-2
detected charset: UTF-8

How should I use Tika to get sensible results?

Ps: Here is a mini demo: https://github.com/riskop/tikaproblem

Solution

There's a detectAll() method on detector, with that one can get all the encodings Tika considered matching to the input. I can solve my problem with following this rule: if UTF-8 is among the matching encodings the input is accepted (because it is possibly UTF-8), else the input is rejected as not UTF-8.

I understand that Tika must use heuristics, and I understand that there are inputs which can be valid UTF-8 or other encoded texts at the same time.

So for example

    bytes = "Only ascii plain english text".getBytes("UTF-8");
    printCharsetArray(new CharsetDetector().setText(bytes).detectAll());

results in:

Match of ISO-8859-1 in nl with confidence 40
Match of ISO-8859-2 in ro with confidence 30
Match of UTF-8 with confidence 15
Match of ISO-8859-9 in tr with confidence 10
Match of Big5 in zh with confidence 10
Match of EUC-KR in ko with confidence 10
Match of EUC-JP in ja with confidence 10
Match of GB18030 in zh with confidence 10
Match of Shift_JIS in ja with confidence 10
Match of UTF-16LE with confidence 10
Match of UTF-16BE with confidence 10

Which is usable in my case, although the two "best" matches are ISO-8859-1 and 2, the third best is UTF-8, so I can accept the input.

For invalid UTF-8 input it seems also working.

For example 0xc3, 0xa9, 0xa9

    bytes = new byte[]{(byte)0xC3, (byte)0xA9, (byte)0xA9}; // illegal utf-8: Cx leading byte followed by two continuation bytes 
    printCharsetArray(new CharsetDetector().setText(bytes).detectAll());

results:

Match of Big5 in zh with confidence 10
Match of EUC-KR in ko with confidence 10
Match of EUC-JP in ja with confidence 10
Match of GB18030 in zh with confidence 10

Which is good, there's no UTF-8 among the matches.

A more likely input is text with accented chars encoded with not UTF-8 encoding:

    bytes = "this is somethingó not utf8 é".getBytes("ISO-8859-2");
    printCharsetArray(new CharsetDetector().setText(bytes).detectAll());

results:

Match of ISO-8859-2 in hu with confidence 31
Match of ISO-8859-1 in en with confidence 31
Match of KOI8-R in ru with confidence 10
Match of UTF-16LE with confidence 10
Match of UTF-16BE with confidence 10

Which is good, because no UTF-8 among the results.