Search code examples
javaamazon-s3mime-typescompressed-files

Find out MIME Type of compressed files downloaded from S3 for Java


A client is supposed to upload a compressed file into an S3 folder. Then the compressed file is downloaded and decompressed to perform various operations on its contained files. Originally we told our client to compress its files into a ZIP file, but this proved too difficult for our client. Instead it submitted a RAR file with ZIP extension... how clever. For obvious reasons one can't decompress a RAR file using a ZIP decompressing algorithm.

So, I'm looking for a way to find out the file type of the S3 downloaded files given that I'm working on a Java project with Amazon's SDK on a Linux OS. I'll take care of how to decompress the file depending on the obtained file type.

I've looked at many stack overflow questions, like this one, but none seem 100% effective just by looking at them (and its comments).

What would be the best approach to find out the compressed file's type?


Solution

  • TL;DR;

    When one uploads a file to Amazon S3 programatically, one could specify the object's Content-Type. If one specifies none, as @Michael-bot clarifies, the value assigned by default will be binary/octet-stream. Or if one decides to upload the file through Amazon S3's GUI, the file gets its Content-Type from its file extension (sadly, not its contents). If you can trust whoever uploaded the file to set the Content-Type correctly, go ahead and look at the ObjectMetadata, but if you can't (like me), you would need another solution.

    So, if you are looking for a solution that works on the most common file compression types, Files.probeContentType, Apache Tika and SimpleMagic seem to be acceptable solutions.

    In the end I chose Files.probeContentType as it required no extra libraries and works just fine on a Linux machine (as long as the file doesn't have the wrong extension, for which there is a workaround: remove the file extension and let it do its magic).


    The Test Setup

    At first one would think that the response object when downloading the file from Amazon's S3 includes the file type. And it does contain this information, but the problem arises when the extension of the file doesn't match its contents.

    import com.amazonaws.services.s3.model.S3Object;
    
    final S3Object s3Object = ...;
    final String contentType = s3Object.getObjectMetadata().getContentType();
    

    This code would return application/zip even if the contents of the file are of a Rar file. So this solution doesn't work for me.

    For this reason I took the time to build a sample project that tested various scenarios with the different approaches and libraries available. I'm using Java 8 by the way.

    The files types tested are:

    • A Zip file with Zip extension and without extension
    • A Rar file with Rar extension, Zip extension and without extension
    • A 7z file with 7z extension, Zip extension and without extension
    • A Tar.xz with Tar.xz extension, Zip extension and without extension
    • A Tar.gz with Tar.gz extension, Zip extension and without extension

    Beware, the implementations presented here are only for testing purposes. They are not in any way endorsed to be used in production code, as they don't consider file locking problems among other things that my imagination couldn't bother to consider. =)


    MimetypesFileTypeMap

    Implementation

    import java.io.File;
    import javax.activation.MimetypesFileTypeMap;
    
    final File file = new File(basePath + "/" + fileName);
    try {
        return MimetypesFileTypeMap.getDefaultFileTypeMap().getContentType(file);
    } catch (final Exception exception) {
        return "<EXCEPTION: " + exception.getMessage() + ">";
    }
    

    Results

    Rar with Rar extension is:       application/octet-stream
    Rar with Zip extension is:       application/octet-stream
    Zip with Zip extension is:       application/octet-stream
    7z with 7z extension is:         application/octet-stream
    7z with Zip extension is:        application/octet-stream
    Tar.xz with Tar.xz extension is: application/octet-stream
    Tar.xz with Zip extension is:    application/octet-stream
    Tar.gz with Tar.gz extension is: application/octet-stream
    Tar.gz with Zip extension is:    application/octet-stream
    Rar without extension is:        application/octet-stream
    Zip without extension is:        application/octet-stream
    7z without extension is:         application/octet-stream
    Tar.xz without extension is:     application/octet-stream
    Tar.gz without extension is:     application/octet-stream
    

    Conclusion

    The value returned by this approach when a file type has not been recognized is application/octet-stream. It seems all scenarios failed so we should discard this approach.


    URLConnection.guessContentTypeFromStream

    Implementation

    import java.io.File;
    import java.io.FileInputStream;
    import java.io.InputStream;
    import java.io.BufferedInputStream;
    import java.net.URLConnection;
    
    final File file = new File(basePath + "/" + fileName);
    try {
        final FileInputStream fileInputStream = new FileInputStream(file);
        final InputStream inputStream = new BufferedInputStream(fileInputStream);
    
        return URLConnection.guessContentTypeFromStream(inputStream);
    } catch (final Exception exception) {
        return "<EXCEPTION: " + exception.getMessage() + ">";
    }
    

    Results

    Rar with Rar extension is:       null
    Rar with Zip extension is:       null
    Zip with Zip extension is:       null
    7z with 7z extension is:         null
    7z with Zip extension is:        null
    Tar.xz with Tar.xz extension is: null
    Tar.xz with Zip extension is:    null
    Tar.gz with Tar.gz extension is: null
    Tar.gz with Zip extension is:    null
    Rar without extension is:        null
    Zip without extension is:        null
    7z without extension is:         null
    Tar.xz without extension is:     null
    Tar.gz without extension is:     null
    

    Conclusion

    Again, this method fails all scenarios. It seems its support is very limited.


    Files.probeContentType

    Implementation

    import java.nio.file.Files;
    import java.nio.file.Path;
    import java.nio.file.Paths;
    
    try {
        final Path path = Paths.get(basePath + "/" + fileName);
        return Files.probeContentType(path);
    } catch (final Exception exception) {
        return "<EXCEPTION: " + exception.getMessage() + ">";
    }
    

    Results

    Rar with Rar extension is:       application/vnd.rar
    Rar with Zip extension is:       application/zip
    Zip with Zip extension is:       application/zip
    7z with 7z extension is:         application/x-7z-compressed
    7z with Zip extension is:        application/zip
    Tar.xz with Tar.xz extension is: application/x-xz-compressed-tar
    Tar.xz with Zip extension is:    application/zip
    Tar.gz with Tar.gz extension is: application/x-compressed-tar
    Tar.gz with Zip extension is:    application/zip
    Rar without extension is:        application/vnd.rar
    Zip without extension is:        application/zip
    7z without extension is:         application/x-7z-compressed
    Tar.xz without extension is:     application/x-xz
    Tar.gz without extension is:     application/gzip
    

    Conclusion

    This method worked surprisingly well, but don't be fooled, there is a scenario where it consistently fails. If a file has the wrong extension (one that doesn't match is content) it will report the file type to be the extension. It should not happen very often, but if one is very picky this method is not to be used.

    Also, some warn that his approach doesn't work well in Windows.

    Workaround: If one manages to remove the extension from the filename, this would return the proper value for all the given scenarios.


    Apache Tika (tika-eval 1.18)

    There seem to be many flavors of this library (app, server, eval, etc), but many around the web complain about it being somewhat "dependency-heavy".

    Implementation

    import org.apache.tika.Tika;
    
    try {
        return new Tika().detect(new File(basePath + "/" + fileName));
    } catch (final Exception exception) {
        return "<EXCEPTION: " + exception.getMessage() + ">";
    }
    

    Results

    Rar with Rar extension is:       application/x-rar-compressed
    Rar with Zip extension is:       application/x-rar-compressed
    Zip with Zip extension is:       application/zip
    7z with 7z extension is:         application/x-7z-compressed
    7z with Zip extension is:        application/x-7z-compressed
    Tar.xz with Tar.xz extension is: application/x-xz
    Tar.xz with Zip extension is:    application/x-xz
    Tar.gz with Tar.gz extension is: application/gzip
    Tar.gz with Zip extension is:    application/gzip
    Rar without extension is:        application/x-rar-compressed
    Zip without extension is:        application/zip
    7z without extension is:         application/x-7z-compressed
    Tar.xz without extension is:     application/x-xz
    Tar.gz without extension is:     application/gzip
    

    Conclusion

    All files were properly identified, but as it has its advantages it also has its disadvantages.

    Pros:

    • Maintained by Apache.
    • Does not get fooled by extensions.

    Cons:

    • Really heavy, specially if one only wants to check get the file type. The Tika-eval Jar weights +40MB.

    URLConnection

    Implementation

    import java.net.URL;
    import java.net.URLConnection;
    
    try {
        final URL url = new URL("file://" + basePath + "/" + fileName);
        final URLConnection urlConnection = url.openConnection();
        return urlConnection.getContentType();
    } catch (final Exception exception) {
        return "<EXCEPTION: " + exception.getMessage() + ">";
    }
    

    Results

    Rar with Rar extension is:       content/unknown
    Rar with Zip extension is:       application/zip
    Zip with Zip extension is:       application/zip
    7z with 7z extension is:         content/unknown
    7z with Zip extension is:        application/zip
    Tar.xz with Tar.xz extension is: content/unknown
    Tar.xz with Zip extension is:    application/zip
    Tar.gz with Tar.gz extension is: application/octet-stream
    Tar.gz with Zip extension is:    application/zip
    Rar without extension is:        content/unknown
    Zip without extension is:        content/unknown
    7z without extension is:         content/unknown
    Tar.xz without extension is:     content/unknown
    Tar.gz without extension is:     content/unknown
    

    Conclusion

    It hardly identifies any file compression format, and guides itself by the extension, not its contents.


    SimpleMagic 1.14

    This project seems to be updated at least once a year.

    Implementation

    import com.j256.simplemagic.ContentInfo;
    import com.j256.simplemagic.ContentInfoUtil;
    
    try {
        final ContentInfoUtil util = new ContentInfoUtil();
        final ContentInfo info = util.findMatch(basePath + "/" + fileName);
    
        return info.getMimeType();
    } catch (final Exception exception) {
        return "<EXCEPTION: " + exception.getMessage() + ">";
    }
    

    Results

    Rar with Rar extension is:       application/x-rar
    Rar with Zip extension is:       application/x-rar
    Zip with Zip extension is:       application/zip
    7z with 7z extension is:         application/x-7z-compressed
    7z with Zip extension is:        application/x-7z-compressed
    Tar.xz with Tar.xz extension is: <EXCEPTION: null>
    Tar.xz with Zip extension is:    <EXCEPTION: null>
    Tar.gz with Tar.gz extension is: application/x-gzip
    Tar.gz with Zip extension is:    application/x-gzip
    Rar without extension is:        application/x-rar
    Zip without extension is:        application/zip
    7z without extension is:         application/x-7z-compressed
    Tar.xz without extension is:     <EXCEPTION: null>
    Tar.gz without extension is:     application/x-gzip
    

    Conclusion

    It worked for almost all our scenarios, but it seems that for the most "obscure" compression formats like Tar.xz it failed to detect them (and threw an exception in the process).


    MimeUtil 2.1.3

    This project has not been modified since 2010, so don't expect support or updates. It is just listed here for the sake of completion.

    Implementation

    import eu.medsea.mimeutil.MimeUtil2;
    
    try {
        final MimeUtil2 mimeUtil = new MimeUtil2();
            mimeUtil.registerMimeDetector("eu.medsea.mimeutil.detector.MagicMimeMimeDetector");
    
        return MimeUtil2.getMostSpecificMimeType(mimeUtil.getMimeTypes(basePath + "/" + fileName)).toString();
    } catch (final Exception exception) {
        return "<EXCEPTION: " + exception.getMessage() + ">";
    }
    

    Results

    Rar with Rar extension is:       application/x-rar
    Rar with Zip extension is:       application/x-rar
    Zip with Zip extension is:       application/zip
    7z with 7z extension is:         application/octet-stream
    7z with Zip extension is:        application/octet-stream
    Tar.xz with Tar.xz extension is: application/octet-stream
    Tar.xz with Zip extension is:    application/octet-stream
    Tar.gz with Tar.gz extension is: application/x-gzip
    Tar.gz with Zip extension is:    application/x-gzip
    Rar without extension is:        application/x-rar
    Zip without extension is:        application/zip
    7z without extension is:         application/octet-stream
    Tar.xz without extension is:     application/octet-stream
    Tar.gz without extension is:     application/x-gzip
    

    Conclusion

    It identifies some of the most popular file types, but fails with Tar.xz and 7z.


    file - Command Line

    Not the prettiest solution, but it had to be tried: Ubuntu file command.

    Implementation

    import java.io.BufferedReader;
    import java.io.InputStreamReader;
    
    try {
        final Process process = Runtime.getRuntime().exec("file --mime-type " + basePath + "/" + fileName);
    
        final BufferedReader stdInput = new BufferedReader(new InputStreamReader(process.getInputStream()));
    
        String text = "";
    
        String s;
        while ((s = stdInput.readLine()) != null) {
            text += s;
        }
    
        return text.split(": ")[1];
    } catch (final Exception exception) {
        return "<EXCEPTION: " + exception.getMessage() + ">";
    }
    

    Results

    Rar with Rar extension is:       application/x-rar
    Rar with Zip extension is:       application/x-rar
    Zip with Zip extension is:       application/zip
    7z with 7z extension is:         application/x-7z-compressed
    7z with Zip extension is:        application/x-7z-compressed
    Tar.xz with Tar.xz extension is: application/x-xz
    Tar.xz with Zip extension is:    application/x-xz
    Tar.gz with Tar.gz extension is: application/gzip
    Tar.gz with Zip extension is:    application/gzip
    Rar without extension is:        application/x-rar
    Zip without extension is:        application/zip
    7z without extension is:         application/x-7z-compressed
    Tar.xz without extension is:     application/x-xz
    Tar.gz without extension is:     application/gzip
    

    Conclusion

    It works for all our scenarios, but again, this relies on the command File being present on the System running the code.