Search code examples
javamime-typesfile-type

How to accurately determine mime data from a file?


I'm adding some functionality to a program so that I can accurately determine the files type by reading the MIME data. I've already tried a few methods:

Method 1:

javax.activation.FileDataSource

FileDataSource ds = new FileDataSource("~\\Downloads\\777135_new.xls");  
String contentType = ds.getContentType();  
System.out.println("The MIME type of the file is: " + contentType);

//output = The MIME type of the file is: application/octet-stream

Method 2:

import net.sf.jmimemagic.*;

try
{
    RandomAccessFile f = new RandomAccessFile("~\\Downloads\\777135_new.xls", "r");
    byte[] fileBytes = new byte[(int)f.length()];
    f.read(fileBytes);
    MagicMatch match = Magic.getMagicMatch(fileBytes);
    System.out.println("The Mime type is: " + match.getMimeType());
}
catch(Exception e)
{
    System.out.println(e);
}

//output = The Mime type is: application/msword

Method 3:

import eu.medsea.mimeutil.*;

MimeUtil.registerMimeDetector("eu.medsea.mimeutil.detector.MagicMimeMimeDetector");
File f = new File ("~\\Downloads\\777135_new.xls");
Collection<?> mimeTypes = MimeUtil.getMimeTypes(f);
String mimeType = MimeUtil.getFirstMimeType(mimeTypes.toString()).toString();
String subMimeType = MimeUtil.getSubType(mimeTypes.toString());
System.out.println("The Mime type is: " + mimeTypes + ", " + mimeType + ", " + subMimeType);

//output = The Mime type is: application/msword, application/msword, msword

I found these three methods at http://www.rgagnon.com/javadetails/java-0487.html. However my problem is that the file I am testing these methods on is one I created and so I know it's an Excel file, but still all three methods are incorrectly picking up the type as msword except the first method which I believe is because of the limited number of file types in the built in FileTypeMap that the method uses.

I've had a look around and some people say that it's because the way the offset is detected in the files and so the content type is picked up incorrectly, as pointed out in this wiki on detecting file types in PHP. Unfortunately the wiki then goes on to use the extension to determine the file type which isn't what I want to do as it's unreliable.

Can anyone point me in the right direction to a method that will detect the file types correctly within Java please?

Cheers, Alexei Blue.

Edit: Looks like there is no specific solution to this as @IronMensan said in the comment below. I did find this really interesting research paper that applies machine learning in a few ways to help the issue but there doesn't seem to be a full proof answer. I think my best bet here will be to try and pass the file to an excel file reader and catch any incorrect format exceptions.


Solution

  • As mentioned in the comments since there's so many possible file types it could be hit and miss for ALL possibile files, but you probably know the types of files you are typically going to be dealing with. This excellent list of magic numbers has helped me do detection recently around the specific office formats you mentioned (search for Microsoft Office) and you'll see that the MS office file types have a sub-type specified (which is further into the file) and lets you work out specifically which type of file you have. Many new formats like ODT, DOCX, OOXML etc use a ZIP file to hold their data so you might need to detect zip first, then look for specifics.