Search code examples
pythondataseturllib

Get only domain name from urls using urlsplit


I have a dataset with urls in different forms (e.g https://stackoverflow.com, https://www.stackoverflow.com, stackoverflow.com) and I need to have only domain name like stackoverflow.

I used a parse.urlsplit(url) from urllib but it's not working well in my case.

How can I get only the domain name?

edit.:

My code :

def normalization (df):
  df['after_urlsplit'] = df["httpx"].map(lambda x: parse.urlsplit(x))
  return df

normalization(df_sample)

output:

            httpx                       after_urlsplit
0   https://stackoverflow.com/       (https, stackoverflow.com, /, , )
1   https://www.stackoverflow.com/   (https, www.stackoverflow.com, /, , )
2   www.stackoverflow.com/           (, , www.stackoverflow.com/, , )
3   stackoverflow.com/               (, , stackoverflow.com/, , )

Solution

  • New answer, working for urls and host names too

    To handle instances where there is no protocol definition (e.g. example.com) it is better to use a regex:

    import re
    
    urls = ['www.stackoverflow.com',
            'stackoverflow.com',
            'https://stackoverflow.com',
            'https://www.stackoverflow.com/',
            'www.stackoverflow.com',
            'stackoverflow.com',
            'https://subdomain.stackoverflow.com/']
    
    for url in urls:
        host_name = re.search("^(?:.*://)?(.*)$", url).group(1).split('.')[-2]
        print(host_name)
    

    This prints stackoverflow in all cases.

    Old answer, working for urls only

    You can use the value of netloc returned by the urlsplit, additionally with some extra tailoring to get the domain (part) you want:

    from urllib.parse import urlsplit
    
    m = urlsplit('http://subdomain.example.com/some/extra/things')
    
    print(m.netloc.split('.')[-2])
    

    This prints example.

    (However, this would fail on urls like http://localhost/some/path/to/file.txt)