I m parsing URLS like this
>>> from urllib.parse import urlparse
>>> urlparse('http://foo.bar/path/to/heaven')
ParseResult(scheme='http', netloc='foo.bar', path='/path/to/heaven', params='', query='', fragment='')
Suppose I have an URL that has a malformed path with recurrent /
like this:
>>> x = urlparse('http://foo.bar/path/to/////foo///baz//bar'))
ParseResult(scheme='http', netloc='foo.bar', path='/path/to/////foo///baz//bar', params='', query='', fragment='')
As you can see, the x.path
still contain recurrent slashes, I'm trying to remove them so I have tried split and looping and replacing like this:
>>> newpath = x.path.split('/')
['', 'path', 'to', '', '', '', '', 'foo', '', '', 'baz', '', 'bar']
>>> for i in newpath:
if i == '':
newpath.remove('')
>>> '/'.join(newpath)
'/path/to/foo/baz/bar'
Which gives the desired output but i think this solution is inefficient and trash. How can I do it better?
This is what regular expressions are made for:
import regex as re
url = "http://foo.bar/path/to/////foo///baz//bar"
rx = re.compile(r'(?:(?:http|ftp)s?://)(*SKIP)(*FAIL)|/+')
url = rx.sub('/', url)
print(url)
This yields
http://foo.bar/path/to/foo/baz/bar
See a demo on regex101.com. The only real problem is to leave any double forward slashes in the protocol as they are, hence the newer regex
module and (*SKIP)(*FAIL)
. You could achieve the same functionality with lookbehinds in the re
module.