Search code examples
pythonregexurlstrip

Strip URL - Python


Ok how do i use regex to remove http AND/OR www just to get http://www.domain.com/ into domain.com

Assume x as any kind of TLD or cTLD

Input example:

http://www.domain.x/

www.domain.x

Output:

domain.x


Solution

  • If you really want to use regular expressions instead of urlparse() or splitting the string:

    >>> domain = 'http://www.example.com/'
    >>> re.match(r'(?:\w*://)?(?:.*\.)?([a-zA-Z-1-9]*\.[a-zA-Z]{1,}).*', domain).groups()[0]
    example.com
    

    The regular expression might a bit simplistic, but works. It's also not replacing, but I think getting the domain out is easier.

    To support domains like 'co.uk', one can do the following:

    >>> p = re.compile(r'(?:\w*://)?(?:.*?\.)?(?:([a-zA-Z-1-9]*)\.)?([a-zA-Z-1-9]*\.[a-zA-Z]{1,}).*')
    >>> p.match(domain).groups()
    

    ('google', 'co.uk')

    So you got to check the result for domains like 'co.uk', and join the result again in such a case. Normal domains should work OK. I could not make it work when you have multiple subdomains.

    One-liner without regular expressions or fancy modules:

    >>> domain = 'http://www.example.com/'
    >>> '.'.join(domain.replace('http://','').split('/')[0].split('.')[-2:])