Search code examples
ms-wordpython-requestsdocxdoc

doc or docx: Is there safeway to identify the type from 'requests' in python3?


1) How can I differentiate doc and docx files from requests?

a) For instance, if I have

url='https://www.iadb.org/Document.cfm?id=36943997'
r = requests.get(url,timeout=15)
print(r.headers['content-type'])

I get this:

application/vnd.openxmlformats-officedocument.wordprocessingml.document

This file is a docx.

b) If I have

url='https://www.iadb.org/Document.cfm?id=36943972'
r = requests.get(url,timeout=15)
print(r.headers['content-type'])

I get this

application/msword

This file is a doc.

2) Are there other options?

3) If I save a docx file as doc or vice-versa may I have recognition problems (for instance, for conversion to pdf?)? Is there any kind of best practice for dealing with this?


Solution

  • The mime headers you get appear to be the correct ones: What is a correct mime type for docx, pptx etc?

    However, the sending software can only go on what file its user selected – and there still are a lot of people sending files with the wrong extension. Some software can handle this, others cannot. To see this in action, change the name of a PNG image to end with JPEG instead. I just did on my Mac and Preview still is able to open it. When I press ⌘+I in the Finder it says it is a JPEG file, but when opened in Preview it gets correctly identified as a "Portable Network Graphics" file. (Your OS may or may not be able to do this.)

    But after the file is downloaded, you can unambiguously differ between a DOC and a DOCX file, even if the author got its extension wrong.

    A DOC file starts with a Microsoft OLE Header, which is quite complicated structure. A DOCX file, on the other hand, is a compound file format containing lots of smaller XML files, compressed together using a standard ZIP file compression. Therefore, this file type always will start with the two characters PK.

    This check is compatible with Python 2.7 and 3.x (only one needs the decode):

    import sys
    
    if len(sys.argv) == 2:
        print ('testing file: '+sys.argv[1])
        with open(sys.argv[1], 'rb') as testMe:
            startBytes = testMe.read(2).decode('latin1')
            print (startBytes)
            if startBytes == 'PK':
                print ('This is a DOCX document')
            else:
                print ('This is a DOC document')
    

    Technically it will confidently state "This is a DOC document" for anything that does not start with PK, and, conversely, it will say "This is a DOCX document" for any zipped file (or even a plain text file that happens to start with those two characters). So if you further process the file based on this decision, you may find out it's not a Microsoft Word document after all. But at least you will have tried with the proper decoder.