I am trying to create a regex filter that will be used to sanitize domains that are processed by a python script.
The domains could possibly be just regular domain names
or could have a url structure
or could have url structure with www
I currently have a crude regex to pull out domains out of these structures except I have not figured out a way to filter out the www. out.
(?:[a-zA-Z0-9](?:[a-zA-Z0-9\-@]{,61}[a-zA-Z0-9])?\.)+[a-zA-Z]{2,11}
This regex does a decent job grabbing domains out of urls, but when I try to do any kind of negative lookahead to remove the www.,I can't seem to get the desired result. I've tried (?!www.) which only took away one w not all 3 and the ., any help figuring this out would be most appreciated.
Unless you absolutely have to use regex, it's better to use something designed for this - like the built-in urlparse
. For one thing, your regex (and the one linked in the comments) won't match domains with non-ASCII characters.
>>> from urlparse import urlparse # Python 2
>>> # from urllib.parse import urlparse # Python 3
>>> urlparse('http://www.some.domain/the/path')
ParseResult(scheme='http', netloc='www.some.domain', path='/the/path', params='', query='', fragment='')
>>> urlparse('http://www.some.domain/the/path').netloc
'www.some.domain'
Note that you might want to detect strings without scheme
and add it:
>>> url = 'www.other.domain'
>>> urlparse(url)
ParseResult(scheme='', netloc='', path='www.other.domain', params='', query='', fragment='')
>>> if not urlparse(url).scheme:
... print urlparse('http://' + url)
ParseResult(scheme='http', netloc='www.other.domain', path='', params='', query='', fragment='')
so you always get the domain in the netloc
attribute of the ParseResult
.
Once you have the domain separated out, if you want to remove the 'www.', there are any number of simple ways to do it.