Search code examples
pythonurlurllib

Why does python's urljoin not work as expected?


With python 3.8 I want to join two parts of a URL into one. Here is an example:

domain = "https://some.domain.ch/myportal#/"
urllib.parse.urljoin(domain, "test1")

this gives the output

'https://some.domain.ch/test1'

but I expect the output

'https://some.domain.ch/myportal#/test1'

Asking just to understand.

As a workaround I will use

domain + "test1"

Solution

  • urllib.parse.urlparse(domain)
    ParseResult(scheme='https', netloc='some.domain.ch', path='/myportal', params='', query='', fragment='/')
    

    The problem is that you have a # in your path, which is incorrect per the specification RFC-3986 that urllib.parse follows.

    See §3 for a diagram of the parts of an URL :

             foo://example.com:8042/over/there?name=ferret#nose
             \_/   \______________/\_________/ \_________/ \__/
              |           |            |            |        |
           scheme     authority       path        query   fragment
    

    The path is defined in §3.3. Yours is /myportal, which relates to the rules

    path-absolute = "/" [ segment-nz *( "/" segment ) ]
    ...
    segment-nz    = 1*pchar
    

    whose pchar is defined in §A :

       pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"
    ...
       pct-encoded   = "%" HEXDIG HEXDIG
    
       unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
    ...
       sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
                     / "*" / "+" / "," / ";" / "="
    

    The # can not be a pchar so the path stops there.

    Either remove the # if it is not required :

    >>> import urllib.parse
    >>> urllib.parse.urljoin("https://some.domain.ch/myportal/", "test1")
    'https://some.domain.ch/myportal/test1'
    

    Or percent-encode it :

    >>> urllib.parse.quote("#")
    '%23'
    >>> urllib.parse.urljoin("https://some.domain.ch/myportal%23/", "test1")
    #                                                        ^^^
    'https://some.domain.ch/myportal%23/test1'