I have a dataset with urls in different forms (e.g https://stackoverflow.com, https://www.stackoverflow.com, stackoverflow.com
) and I need to have only domain name like stackoverflow
.
I used a parse.urlsplit(url)
from urllib
but it's not working well in my case.
How can I get only the domain name?
edit.:
My code :
def normalization (df):
df['after_urlsplit'] = df["httpx"].map(lambda x: parse.urlsplit(x))
return df
normalization(df_sample)
output:
httpx after_urlsplit
0 https://stackoverflow.com/ (https, stackoverflow.com, /, , )
1 https://www.stackoverflow.com/ (https, www.stackoverflow.com, /, , )
2 www.stackoverflow.com/ (, , www.stackoverflow.com/, , )
3 stackoverflow.com/ (, , stackoverflow.com/, , )
To handle instances where there is no protocol definition (e.g. example.com
) it is better to use a regex:
import re
urls = ['www.stackoverflow.com',
'stackoverflow.com',
'https://stackoverflow.com',
'https://www.stackoverflow.com/',
'www.stackoverflow.com',
'stackoverflow.com',
'https://subdomain.stackoverflow.com/']
for url in urls:
host_name = re.search("^(?:.*://)?(.*)$", url).group(1).split('.')[-2]
print(host_name)
This prints stackoverflow
in all cases.
You can use the value of netloc
returned by the urlsplit, additionally with some extra tailoring to get the domain (part) you want:
from urllib.parse import urlsplit
m = urlsplit('http://subdomain.example.com/some/extra/things')
print(m.netloc.split('.')[-2])
This prints example
.
(However, this would fail on urls like http://localhost/some/path/to/file.txt
)