Search code examples
pythonspliturlparseurl-parsing

How to extract file extension from direct and indirect URLs in Python?


I need to extract file extension from Three type of direct and indirect URL.

"https://needmode.com/products/350e0f54c3480dc035d6db5e7ef898711d5f4ebc_1683455668.jpg"

"https://dkstatics-public.digikala.com/digikala-products/350e0f54c3480dc035d6db5e7ef898711d5f4ebc_1683455668.jpg?x-oss-process=image/resize,m_lfit,h_800,w_800/quality,q_90"

"https://meghdadit.com/_image.ashx?i=%252ffiles%252fproduct%252f4778c8kbqjb7k18sqydnkztp4yzi0jlaug5j5jtybsmuw0lzq2%255blarge%255d.jpg"

My goal is return "jpg" as file extension in all kind of this URLs.

My python code:

from urllib.parse import urlparse
import os
img = "IMAGE URL"
parsed_url = urlparse(img)
filename_and_extension = parsed_url.path.rsplit("/", maxsplit=1)[-1]
file_extension = parsed_url.path.rsplit(".", maxsplit=1)[-1].lower()
print("first method: "+file_extension)
filename, file_extension = os.path.splitext(img)
print("second method: "+file_extension)

The first method is not working on third URL and second method not working on second URL.

Is there a way to prioritize the first method to select the extension from the right side of URLs?


Solution

  • If you have a list of valid extensions, it could potentially be easier to search for those rather than trying to parse the entire URL. Wikipedia seems to have a very, very extensive list of known extensions here:

    https://en.wikipedia.org/wiki/List_of_filename_extensions

    If you then used the above table(s) from Wikipedia to generate your own list of valid extensions, you could employ several different approaches as you would, in a manner, know all the possible answers. I'm a fan of Regex for something like this (assuming you have a long list of potential extensions), but not every solution necessarily needs regex. Suppose you have a .txt file containing all extensions you'd be looking for on different rows

    import re
    extList = open("path/to/.txt").readlines()
    
    # Just to make sure no whitespace is included --> Could also use str.strip()
    extList = [re.sub(r"\s","",ext) for ext in extList]
    
    # Pipe (|) represents OR in regex. (?i) Enables IGNORE_CASE
    regString = "|".join(extList)
    
    # Use list of Ext's, preceded by '.', to find potential matches
    regExtensions = re.compile(rf"(?i)\.(?:{regString})") # "(?i)\.(?:PNG|JPG|DOC.....)"
    

    Which would match the extentions mentioned above. Example shown here: https://regex101.com/r/drTCEY/1

    You could then extract extensions from a given string url as:

    ext = regExtensions.findall(url)[0]

    However, this regex could be improved to exclude false postives, but that would depend on how consistent the URL's are. In the provided examples, it seems that the extension is followed by either the end of the string or a "?", this could be added to the regex as: regExtensions = re.compile(rf"(?i)\.(?:{regString})(?=$|\n|\?)"), but again this would depend on what the URL patterns that you are working with are.

    Lastly, if you want to exclude the . from the match, set that character as a lookbehind as opposed to being part of the match as:

    regExtensions = re.compile(rf"(?i)(?<=\.)(?:{regString})")