Search code examples
pythonurlparse

Python - getting image name and extension from url what does not end with file filename extension


Basically, my goal is to fetch the filename, extension and the content of an image by its url. And my fuction should work for both of these urls:

easy case: https://image.shutterstock.com/image-photo/bright-spring-view-cameo-island-260nw-1048185397.jpg

hard case (does not end with filename.extension ): https://images.unsplash.com/photo-1472214103451-9374bd1c798e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&w=1000&q=80

Currently, what I have looks like this:

from os.path import splitext, basename

def get_filename_from_url(url):
       result = urllib.request.urlretrieve(url)
       filename, file_ext = splitext(basename(result.path))
       print(filename, file_ext)

This works fine for the easy case. But apparently, no solution in case of hard-case url. But I have a feeling that that I can use python's requests module and parse the header to find the mimetype and then use the same module's guesstype functionality to extract the necessary data. So I went on to try this:

import requests

response = requests.get(url, stream=True)

Here, someone seems to describe the clue, saying that enter image description here

but the problem is that using the hard-case url I get something strange in the response dict items, and maybe my key issue is that I don't know the correct way to parse the header of the response to extract what I need.

I've tried a third approach using urlparse:

from urllib.parse import urlparse
result = urlparse(self.url)
print(os.path.basename(a.path)) # 'photo-1472214103451-9374bd1c798e'

which yields the filename, but again, I miss the extension here...

The ideal solution would be to get the filename, file extension and file content in one go, preferrably being able to validate that the url actually contains an image, not something else...

UPD:

The result1 elemet in result = urllib.request.urlretrieve(self.url) seems to contain the Content-Type, by I can't figure out how to extract it correctly.


Solution

  • One way is to query the content type:

    >>> from urllib.request import urlopen
    >>> response = urlopen(url)
    >>> response.info().get_content_type()
    'image/jpeg'
    

    or using urlretrieve as in your edit:

    >>> response = urllib.request.urlretrieve(url)
    >>> response[1].get_content_type()