I want, in Python, normalize a URL. My main purpose is to add slash / at the end of the URL if it is not already present but only if the URL doesn't end in a slash already or a file extension (so images, .php ,files pages, etc. aren't affected).
For example, if it is http://www.example.com
then it should be converted to http://www.example.com/
. But if it is http://www.example.com/image.png
then it should not be affected.
To do this, I use this regular expression /([^/.]+)$
. Regex demo
But it doesn't work in this python code, start_url
is not modified
import re
start_url = "https://zonetuto.fr"
start_url = re.sub(r'/([^/.]+)$', r'/\1/', start_url)
print(start_url)
You could handle this by considering how a URL is constructed.
The part of the URL that follows the netloc is known as the path
For example:
https://www.example.com does not have a path
https://www.example.com/ has a path - i.e., just the /
https://www.example.com/banana has a path - i.e., /banana
Therefore you could utilise urllib.parse as follows:
from urllib.parse import urlparse
def normalise(url):
return url if urlparse(url).path else url + "/"
urls = [
"https://www.example.com",
"https://www.example.com/",
"https://www.example.com/banana"
]
for url in urls:
print(normalise(url))
Output:
https://www.example.com/
https://www.example.com/
https://www.example.com/banana