Search code examples
pythonheader-filesdetectionfile-typemagic-numbers

How to check type of files using the header file signature (magic numbers)?


By entering the file with its extension, my code succeeds to detect the type of the file from the "magic number".

magic_numbers = {'png': bytes([0x89, 0x50, 0x4E, 0x47, 0x0D, 0x0A, 0x1A, 0x0A]),
                 'jpg': bytes([0xFF, 0xD8, 0xFF, 0xE0]),
                 #*********************#
                 'doc': bytes([0xD0, 0xCF, 0x11, 0xE0, 0xA1, 0xB1, 0x1A, 0xE1]),
                 'xls': bytes([0xD0, 0xCF, 0x11, 0xE0, 0xA1, 0xB1, 0x1A, 0xE1]),
                 'ppt': bytes([0xD0, 0xCF, 0x11, 0xE0, 0xA1, 0xB1, 0x1A, 0xE1]),
                 #*********************#
                 'docx': bytes([0x50, 0x4B, 0x03, 0x04, 0x14, 0x00, 0x06, 0x00]),
                 'xlsx': bytes([0x50, 0x4B, 0x03, 0x04, 0x14, 0x00, 0x06, 0x00]),
                 'pptx': bytes([0x50, 0x4B, 0x03, 0x04, 0x14, 0x00, 0x06, 0x00]),
                 #*********************#
                 'pdf': bytes([0x25, 0x50, 0x44, 0x46]),
                 #*********************#
                 'dll': bytes([0x4D, 0x5A, 0x90, 0x00]),
                 'exe': bytes([0x4D, 0x5A]),

                 }

max_read_size = max(len(m) for m in magic_numbers.values()) 
 
with open('file.pdf', 'rb') as fd:
    file_head = fd.read(max_read_size)
 
if file_head.startswith(magic_numbers['pdf']):
    print("It's a PDF File")
else:
    print("It's not a PDF file")

I want to know how I can modify it without specifying this part of code, i.e. once I generate or I enter the file it shows me directly the type of the file.

if file_head.startswith(magic_numbers['pdf']):
    print("It's a PDF File")
else:
    print("It's not a PDF file")

I hope you understand me.


Solution

  • You most like just want to iterate over the loop and test them all.

    You may be able to optimize or provide some error checking by using the extension as well. If you strip off the extension and check that first, you'll be successful most of the time, and if not you may not want to accept "baby.png" as an xlsx file. That would be suspicious and worthy of an error.

    But, if you ignore extension, just loop over the entries:

    for ext in magic_numbers:
        if file_head.startswith(magic_numbers[ext]):
            print("It's a {} File".format(ext))
    

    You probably want to put this in a function that returns the type, so you could just return the type instead of printing it out.

    EDIT Since some share magic numbers, we need to assume the extension is correct until we know that it isn't. I would extract the extension from the filename. This could be done with Pathlib or just string split:

    ext = filename.rsplit('.', 1)[-1]
    

    then test it specifically

    if ext in magic_numbers:
        if file_head.startswith(magic_numbers[ext]):
            return ext
    

    put the ext test first, so putting it all together:

    ext = filename.rsplit('.', 1)[-1]
    if ext in magic_numbers:
        if file_head.startswith(magic_numbers[ext]):
            return ext
    
    for ext in magic_numbers:
        if file_head.startswith(magic_numbers[ext]):
            return ext
    
    return nil