Search code examples
encodingdetectionapache-tika

Tika is not detecting plain ascii input


We have a byte sequence input and we need to check if it's UTF-8 or plain ASCII or something else. In other words, we have to reject ISO-8859-X latin-x or other encoded input.

Our first choice was Tika, but we have a problem with it: plain ascii input (input with no accented chars at all) is often detected as ISO-8859-2 or ISO-8859-1 !

This is the problematic part:

    CharsetDetector detector = new CharsetDetector();
    String ascii = "Only ascii Visible:a;Invisible:GUID\nX;XXddd\n";
    detector.setText(ascii.getBytes());
    System.out.println("detected charset: " + detector.detect().getName());
    String ascii2 = "Only ascii plain english text";
    detector.setText(ascii2.getBytes());
    System.out.println("detected charset: " + detector.detect().getName());
    String ascii3 = "this is ISO-8859-2 do not know why";
    detector.setText(ascii3.getBytes());
    System.out.println("detected charset: " + detector.detect().getName());
    String ascii4 = "this is UTF-8 but tell me why o why maybe sdlkfjlksdjlkfjlksdjflkjlskdjflkjsdjkflkdsjlkfjldsjlkfjldkjkfljdlkjsdfhjshdkjfhjksdhjfkksdfksjdfhkjsdhj";
    detector.setText(ascii4.getBytes());
    System.out.println("detected charset: " + detector.detect().getName());

This is the output

detected charset: ISO-8859-2
detected charset: ISO-8859-1
detected charset: ISO-8859-2
detected charset: UTF-8

How should I use Tika to get sensible results?

Ps: Here is a mini demo: https://github.com/riskop/tikaproblem


Solution

  • There's a detectAll() method on detector, with that one can get all the encodings Tika considered matching to the input. I can solve my problem with following this rule: if UTF-8 is among the matching encodings the input is accepted (because it is possibly UTF-8), else the input is rejected as not UTF-8.

    I understand that Tika must use heuristics, and I understand that there are inputs which can be valid UTF-8 or other encoded texts at the same time.

    So for example

        bytes = "Only ascii plain english text".getBytes("UTF-8");
        printCharsetArray(new CharsetDetector().setText(bytes).detectAll());
    

    results in:

    Match of ISO-8859-1 in nl with confidence 40
    Match of ISO-8859-2 in ro with confidence 30
    Match of UTF-8 with confidence 15
    Match of ISO-8859-9 in tr with confidence 10
    Match of Big5 in zh with confidence 10
    Match of EUC-KR in ko with confidence 10
    Match of EUC-JP in ja with confidence 10
    Match of GB18030 in zh with confidence 10
    Match of Shift_JIS in ja with confidence 10
    Match of UTF-16LE with confidence 10
    Match of UTF-16BE with confidence 10
    

    Which is usable in my case, although the two "best" matches are ISO-8859-1 and 2, the third best is UTF-8, so I can accept the input.

    For invalid UTF-8 input it seems also working.

    For example 0xc3, 0xa9, 0xa9

        bytes = new byte[]{(byte)0xC3, (byte)0xA9, (byte)0xA9}; // illegal utf-8: Cx leading byte followed by two continuation bytes 
        printCharsetArray(new CharsetDetector().setText(bytes).detectAll());
    

    results:

    Match of Big5 in zh with confidence 10
    Match of EUC-KR in ko with confidence 10
    Match of EUC-JP in ja with confidence 10
    Match of GB18030 in zh with confidence 10
    

    Which is good, there's no UTF-8 among the matches.

    A more likely input is text with accented chars encoded with not UTF-8 encoding:

        bytes = "this is somethingó not utf8 é".getBytes("ISO-8859-2");
        printCharsetArray(new CharsetDetector().setText(bytes).detectAll());
    

    results:

    Match of ISO-8859-2 in hu with confidence 31
    Match of ISO-8859-1 in en with confidence 31
    Match of KOI8-R in ru with confidence 10
    Match of UTF-16LE with confidence 10
    Match of UTF-16BE with confidence 10
    

    Which is good, because no UTF-8 among the results.