Search code examples
pythonurlpathurlliburl-scheme

Why does urllib.parse not split the the URL <scheme>:<number> correctly in all cases?


If I input a URL of the form <scheme>:<integer> then neither function splits off the scheme correctly depending on the scheme being used. If I alter <integer> by adding a non-digit character this works as expected. (I'm on python 3.8.8)

>>> from urllib.parse import urlparse
>>> urlparse("custom:12345")  # does not work
ParseResult(scheme='', netloc='', path='custom:12345', params='', query='', fragment='')
>>> urlparse("zip:12345")  # does not work
ParseResult(scheme='', netloc='', path='zip:12345', params='', query='', fragment='')
urlparse("custom:12345d") # this works  as expected
ParseResult(scheme='custom', netloc='', path='12345d', params='', query='', fragment='')
>>> urlparse("custom:12345.")  # so does this
ParseResult(scheme='custom', netloc='', path='12345.', params='', query='', fragment='')
>>> urlparse("http:12345")  # for some reason this works (!?)
ParseResult(scheme='http', netloc='', path='12345', params='', query='', fragment='')
>>> urlparse("https:12345") # yet this does not
ParseResult(scheme='', netloc='', path='https:12345', params='', query='', fragment='')
>>> urlparse("ftp:12345")  # no luck here neither   
ParseResult(scheme='', netloc='', path='ftp:12345', params='', query='', fragment='')

According to Wikipedia, a URI requires a scheme. Empty schemes schould correspond to URI references, which should only treat <scheme>:<number> as a schema-less (relative) path containing a colon if it is preceded by ./.

So why does this break in the way demonstrated above? What I would have expected is that all the cases above split the URI/URL into a <scheme>:<number> where <number> is the path.


Solution

  • You're seeing different results if there are non-numerical characters in the path because of this section:

    # make sure "url" is not actually a port number (in which case
    # "scheme" is really part of the path)
    rest = url[i+1:]
    if not rest or any(c not in '0123456789' for c in rest):
        # not a port number
        scheme, url = url[:i].lower(), rest
    

    In Python 3.8, if the input has the form "<stuff>:<numbers>", the numbers are assumed to be a port, in which case the stuff isn't treated as a scheme and it all ends up in the path.

    This was reported as a bug and (after quite a lot of back and forth!) fixed in Python 3.9; the above was rewritten to simply:

    scheme, url = url[:i].lower(), url[i+1:]
    

    (and some special casing for url[:i] == 'http' was removed).