python regex filtering regex-negation regex-lookarounds

How to use python regex to remove ignore www. and only give the domain name?

I am trying to create a regex filter that will be used to sanitize domains that are processed by a python script.

The domains could possibly be just regular domain names

something.com, some.something.com

or could have a url structure

https://some.something.com

or could have url structure with www

https://www.something.com

I currently have a crude regex to pull out domains out of these structures except I have not figured out a way to filter out the www. out.

(?:[a-zA-Z0-9](?:[a-zA-Z0-9\-@]{,61}[a-zA-Z0-9])?\.)+[a-zA-Z]{2,11}

This regex does a decent job grabbing domains out of urls, but when I try to do any kind of negative lookahead to remove the www.,I can't seem to get the desired result. I've tried (?!www.) which only took away one w not all 3 and the ., any help figuring this out would be most appreciated.

Solution

Unless you absolutely have to use regex, it's better to use something designed for this - like the built-in urlparse. For one thing, your regex (and the one linked in the comments) won't match domains with non-ASCII characters.

>>> from urlparse import urlparse # Python 2
>>> # from urllib.parse import urlparse # Python 3

>>> urlparse('http://www.some.domain/the/path')
ParseResult(scheme='http', netloc='www.some.domain', path='/the/path', params='', query='', fragment='')
>>> urlparse('http://www.some.domain/the/path').netloc
'www.some.domain'

Note that you might want to detect strings without scheme and add it:

>>> url = 'www.other.domain'
>>> urlparse(url)
ParseResult(scheme='', netloc='', path='www.other.domain', params='', query='', fragment='')
>>> if not urlparse(url).scheme:
...     print urlparse('http://' + url)
ParseResult(scheme='http', netloc='www.other.domain', path='', params='', query='', fragment='')

so you always get the domain in the netloc attribute of the ParseResult.

Once you have the domain separated out, if you want to remove the 'www.', there are any number of simple ways to do it.