Search code examples
pythonsplitnormalize

How to normalize URL and disregard anything after the slash?


I have hundreds URLs which I want to normalise to a domain format -> domain.com, domain.ie, domain.de, domain.es etc. However I'm struggling to cover scenarios where there is a text after the '/' symbol.

I assume I need to add another if condition and find where is the first slash (/) in my URL string and then split with something similar to u.rsplit('/', 1)[-1]?

myCode so far:

from w3lib.url import url_query_cleaner
from url_normalize import url_normalize

urls = ['foo.com','www.foo.com/','foo.com/us','foo.com/ca/example-test/']


def canonical_url(u):
    u = url_normalize(u)
    u = url_query_cleaner(u,parameterlist = ['utm_source','utm_medium','utm_campaign','utm_term','utm_content'],remove=True)
    if u.startswith("http://"):
        u = u[7:]
    if u.startswith("https://"):
        u = u[8:]
    if u.startswith("www."):
        u = u[4:]
    if u.endswith("/"):
        u = u[:-1]
    return u

list(map(canonical_url,urls))

currently this returns:

['foo.com', 'foo.com', 'foo.com/us', 'foo.com/ca/example-test']

expected outcome:

['foo.com', 'foo.com', 'foo.com', 'foo.com']

Could someone help me with this please? thank you in advance


Solution

  • You can use URLlib module in python

    from urllib3.util import parse_url
    
    urls = ['foo.com','www.foo.com/','foo.com/us','foo.com/ca/example-test/']
    for url in urls:
       parsed_url = parse_url(url)
       host = parsed_url.host if not parsed_url.host.startswith('www.') else parsed_url.host.lstrip('www.')
    

    Output will be as you expected.