Search code examples
fileoperating-systemfile-type

Does file type rely on file extension?


As a general question: What's the role of file extension when determining file types?

For example, I can change .jpeg file to .png extension and even .txt. Of course, in the case of changing to .txt, it will neither be opened as picture, nor readable.

To determine file type, it seems the safe way is to parse the first few bytes of the file. If extension is not trustable, extension is no more than file name.


Solution

  • As a general rule, you should ALWAYS parse the COMPLETE file in order to be sure that the file is what the extension says. As you can easily imagine, it is pretty simple to create a binary file resembling a e.g. BMP (with a correct header) but then containing something different.

    You should never trust the extension neither the header because otherwise a malicious user could exploit some of your code to generate e.g. a buffer overflow, and this is absolutely paramount if you are writing programs that must run at root/admin privilege.

    Having said the obvious, the file extension nowadays is mainly used so that the OS can associate a program to that particular file (usually calling the program and passing the selected file as first parameter), and then it's up to the program to determine the file content.

    It is a little bit different when talking about executable files. Under Unix, in order to be executable a file has to have the "x" flag set, otherwise it would not run, regardless of the extension. Under Windows, there is not such thing and the OS relies on only a few extensions (EXE, COM, BAT, etc.) to determine which files can be executed.

    The EXE file, for example, has to start with "MZ" followed by some information for its allocation and size (http://www.delorie.com/djgpp/doc/exe/) and the OS surely checks its internal headers. Other formats (e.g. the COM executable format of the MS-DOS era) is just "pure" assembly code, so there is no check done by the OS. It just interprets those opcodes, hoping that everything will be fine.

    So, to summarize:

    • File extension is mainly used so that the OS can call the appropriate program to open it (and passing the filename as the first parameter, argc/argv in C language for example)

    • Windows relies on some file extension to know if a file is executable, while Unix/Mac relies on a particular flag (x) associated with the file

    Two things that are not well known about file extensions: directory names can have extension too, and extension can be way longer than the usual 3 characters.