java compression apache-commons apache-tika

Detecting compression type via Apache Commons Compress

Is there a quick way of reliably detecting the compression type of a file from its content (i.e., not from the file extension), using the Apache Commons Compress API?

Using Apache Tika, one can do

Tika tika = new Tika();
String path = <the full path to the file examined, including the filename>;
FileInputStream fis = new FileInputStream(new File(path));
String type = tika.detect(fis);

and the type variable gets filled with the detected MIME type of the file content (e.g., text/plain, application/zip, etc.).

Ideally, I would like to avoid involving Tika in this process for numerous reasons, including the fact that Tika seems to mis-detect as "text/plain" the AR archive format, which is among the ones producible by Commons Compress.

Solution

Your best bet is likely to be to grab the first few bytes from the file, and check them for the mime magic byte patterns of the various formats you're interested in.

This is what Tika will do for you when you ask it to do detection. You could however code up your own one

It might be possible to pass the stream to each Commons Compress decoder in turn, and assume that the first one to not blow up is the format, but that may be a bit unreliable...

I'd suggest you stick with Tika, and for any format that Tika doesn't current support open a bug report for the detection issue. If you can, upload a very small test file that can be used in a unit test, and if possible the magic detection bytes too. (For a format supported by commons compress, you should be able to find the header details in the commons compress code if needed)