Search code examples
jsonapache-tika

Apache Tika and Json


When I use Apache Tika to determine the file type from the content. XML file is fine but not the json. If content type is json, it will return "text/plain" instead of "application/json".

Any help?

public static String tiKaDetectMimeType(final File file) throws IOException {
    TikaInputStream tikaIS = null;
    try {
        tikaIS = TikaInputStream.get(file);
        final Metadata metadata = new Metadata();
        return DETECTOR.detect(tikaIS, metadata).toString();
    } finally {
        if (tikaIS != null) {
            tikaIS.close();
        }
    }
}

Solution

  • JSON is based on plain text, so it's not altogether surprising that Tika reported it as such when given only the bytes to work with.

    Your problem is that you didn't also supply the filename, so Tika didn't have that to work with. If you had, Tika could've said bytes=plain text + filename=json => json and given you the answer you expected

    The line you're missing is:

    metadata.set(Metadata.RESOURCE_NAME_KEY, filename);
    

    So the fixed code snippet would be:

    tikaIS = TikaInputStream.get(file);
    final Metadata metadata = new Metadata();
    metadata.set(Metadata.RESOURCE_NAME_KEY, file.getName());
    return DETECTOR.detect(tikaIS, metadata).toString();
    

    With that, you'll get back an answer of JSON as you were expecting