Search code examples
pythonurlparse

How to parse URLs using urlparse and split() in python?


Could someone explain to me the purpose of this line host = parsed.netloc.split('@')[-1].split(':')[0]in the following code? I understand that we are trying to get the host name from netlock but I don't understand why we are splitting with the @ delimiter and then again with the : delimiter.

import urlparse
parsed = urlparse.urlparse('https://www.google.co.uk/search?client=ubuntu&channel=fs')
print parsed
host = parsed.netloc.split('@')[-1].split(':')[0]
print host


Result:

ParseResult(scheme='https', netloc='www.google.co.uk', path='/search', params='', query='client=ubuntu&channel=fs, fragment='')

www.google.co.uk

Surely if one just needs the domain, we can get that from urlparse.netloc


Solution

  • Netloc in its full form can have HTTP authentication credentials and a port number:

    login:[email protected]:80
    

    See RFC1808 and RFC1738

    So we potentially have to split that into ["login:password", "www.google.co.uk:80"], take the last part, split that into ["www.google.co.uk", "80"] and take the hostname.

    If these parts are omitted, there's no harm in trying to split on nonexisting delimeters, and no need to check if they're omitted or not.

    urlparse documentation