If I input a URL of the form <scheme>:<integer>
then neither function splits off the scheme correctly depending on the scheme being used. If I alter <integer>
by adding a non-digit character this works as expected. (I'm on python 3.8.8)
>>> from urllib.parse import urlparse
>>> urlparse("custom:12345") # does not work
ParseResult(scheme='', netloc='', path='custom:12345', params='', query='', fragment='')
>>> urlparse("zip:12345") # does not work
ParseResult(scheme='', netloc='', path='zip:12345', params='', query='', fragment='')
urlparse("custom:12345d") # this works as expected
ParseResult(scheme='custom', netloc='', path='12345d', params='', query='', fragment='')
>>> urlparse("custom:12345.") # so does this
ParseResult(scheme='custom', netloc='', path='12345.', params='', query='', fragment='')
>>> urlparse("http:12345") # for some reason this works (!?)
ParseResult(scheme='http', netloc='', path='12345', params='', query='', fragment='')
>>> urlparse("https:12345") # yet this does not
ParseResult(scheme='', netloc='', path='https:12345', params='', query='', fragment='')
>>> urlparse("ftp:12345") # no luck here neither
ParseResult(scheme='', netloc='', path='ftp:12345', params='', query='', fragment='')
According to Wikipedia, a URI requires a scheme. Empty schemes schould correspond to URI references, which should only treat <scheme>:<number>
as a schema-less (relative) path containing a colon if it is preceded by ./
.
So why does this break in the way demonstrated above? What I would have expected is that all the cases above split the URI/URL into a <scheme>:<number>
where <number>
is the path.
You're seeing different results if there are non-numerical characters in the path because of this section:
# make sure "url" is not actually a port number (in which case
# "scheme" is really part of the path)
rest = url[i+1:]
if not rest or any(c not in '0123456789' for c in rest):
# not a port number
scheme, url = url[:i].lower(), rest
In Python 3.8, if the input has the form "<stuff>:<numbers>"
, the numbers
are assumed to be a port, in which case the stuff
isn't treated as a scheme and it all ends up in the path.
This was reported as a bug and (after quite a lot of back and forth!) fixed in Python 3.9; the above was rewritten to simply:
scheme, url = url[:i].lower(), url[i+1:]
(and some special casing for url[:i] == 'http'
was removed).