Search code examples
pythonmime-typesurllib3

Python | Http - can't get the correct mime type


I am building a web crawler using urllib3. Example code:

from urllib3 import PoolManager

pool = PoolManager()
response = pool.request("GET", url)
mime_type = response.getheader("content-type")

I have stumbled upon few links to document files such as docx and epub and the mime type I'm getting from the server is text/plain.It is important to me to get the correct mime type.

Example to a problematic url:

http://lsa.mcgill.ca/pubdocs/files/advancedcommonlawobligations/523-gold_advancedcommonlawobligations_-2013.docx

Right now the logic of getting file's mime type is getting it from the server and if not available trying to get the file's extension.

How come Firefox is not getting confused by these kind of urls and let the user download the file right away? How does it know that this file is not plain text? How can i get the correct mime type?


Solution

  • I haven't read the Firefox source code, but I would guess that Firefox either tries to guess the filetype based on the URL, or refuses to render it inline if it's a specific Content-Type and larger than some maximum size, or perhaps it even inspects some of the file contents to figure out what it is based on a magic number at the start.

    You can use the Python mimetypes module in the standard library to guess what the filetype is based on the URL:

    import mimetypes
    url = "http://lsa.mcgill.ca/pubdocs/files/advancedcommonlawobligations/523-gold_advancedcommonlawobligations_-2013.docx"
    type, encoding = mimetypes.guess_type(url)
    

    In this case, type is "application/vnd.openxmlformats-officedocument.wordprocessingml.document" which is probably what you want.