Get only domain name from urls using urlsplit

I have a dataset with urls in different forms (e.g https://stackoverflow.com, https://www.stackoverflow.com, stackoverflow.com) and I need to have only domain name like stackoverflow.

I used a parse.urlsplit(url) from urllib but it's not working well in my case.

How can I get only the domain name?

edit.:

My code :

def normalization (df):
  df['after_urlsplit'] = df["httpx"].map(lambda x: parse.urlsplit(x))
  return df

normalization(df_sample)

output:

            httpx                       after_urlsplit
0   https://stackoverflow.com/       (https, stackoverflow.com, /, , )
1   https://www.stackoverflow.com/   (https, www.stackoverflow.com, /, , )
2   www.stackoverflow.com/           (, , www.stackoverflow.com/, , )
3   stackoverflow.com/               (, , stackoverflow.com/, , )

Solution

New answer, working for urls and host names too

To handle instances where there is no protocol definition (e.g. example.com) it is better to use a regex:

import re

urls = ['www.stackoverflow.com',
        'stackoverflow.com',
        'https://stackoverflow.com',
        'https://www.stackoverflow.com/',
        'www.stackoverflow.com',
        'stackoverflow.com',
        'https://subdomain.stackoverflow.com/']

for url in urls:
    host_name = re.search("^(?:.*://)?(.*)$", url).group(1).split('.')[-2]
    print(host_name)

This prints stackoverflow in all cases.

Old answer, working for urls only

You can use the value of netloc returned by the urlsplit, additionally with some extra tailoring to get the domain (part) you want:

from urllib.parse import urlsplit

m = urlsplit('http://subdomain.example.com/some/extra/things')

print(m.netloc.split('.')[-2])

This prints example.

(However, this would fail on urls like http://localhost/some/path/to/file.txt)